My ML Project

Authors
Affiliation

Name I, First Name I

Name of the University

Name II, First Name II

Published

May 13, 2024

Abstract

The following machine learning project focuses on…

Warning: package 'ggplot2' was built under R version 4.3.1
Warning: package 'dplyr' was built under R version 4.3.1
Warning: package 'patchwork' was built under R version 4.3.1
Warning: package 'maps' was built under R version 4.3.1
Warning: package 'scales' was built under R version 4.3.1
Warning: package 'ggmap' was built under R version 4.3.1
Warning: package 'knitr' was built under R version 4.3.1
Warning: package 'rmarkdown' was built under R version 4.3.1
Warning: package 'lattice' was built under R version 4.3.1
Warning: package 'plotly' was built under R version 4.3.1
Warning: package 'leaflet' was built under R version 4.3.1

1 Introduction

1.1 Overview and Motivation

1.1.1 Context and Background

The Swiss real estate market, characterized by its resilience and complexity, presents a significant opportunity for advanced analytical approaches to understand pricing dynamics. This project, undertaken as part of a Master’s degree in Machine Learning at the University of Lausanne, aims to harness the power of data science to predict real estate market prices in Switzerland. Utilizing contemporary machine learning techniques within this academic framework not only enhances the learning experience but also contributes to a practical understanding of real estate valuation.

As housing prices continue to fluctuate amid economic uncertainties, such as interest rate changes and demographic shifts Credi Suisse, this investigation is not only timely but also of significant importance to potential investors, policymakers, and the academic community.

1.1.2 Aim Of The Investigation

The primary objective of this study is to predict Swiss real estate market prices using real-time data scraped from ImmoScout24, a prominent Swiss real estate website. This study addresses the significant question of

  • How can machine learning models utilize real-time data scraped from online real estate platforms to predict price trends in the Swiss real estate market?
  • How can machine learning models predict the sale prices of real estate properties in Switzerland based on current market data?

The relevance of this investigation is underpinned by the substantial financial implications of real estate investments and the benefit of predictive insights for both buyers and sellers in the market. The relevance of this study is underscored by the potential insights it could offer, where real estate plays a pivotal role in financial stability and growth.

1.1.3 Description of the Data

The data for this study consists of a meticulously compiled dataset from ImmoScout24, featuring a wide array of variables related to property listings across Switzerland. Fields in the dataset include price, number of rooms, square meters, address, canton, property type, floor, and year of construction. These data points have been gathered through a robust scraping algorithm designed to collect a comprehensive snapshot of the current market. This dataset provides a granular view of the market, essential for training robust machine learning models.

1.1.4 Methodology

This project employs model-based machine learning techniques to quantify the impact of various factors on property prices in Switzerland. The methodology involves training predictive models to interpret the complex relationships within the data, providing a statistical basis for price prediction. This approach allows for an examination of both linear dependencies and more intricate interactions, such as how location and property type combine to influence pricing.

1.1.5 Structure of the Report

The report is structured as follows to provide a coherent narrative and logical flow of analysis:

  • Section 1: Introduction - Outlines the research context, objectives, and significance.
  • Section 2: Data - Details the sources, nature, and preprocessing of the data used.
  • Section 3: Exploratory Data Analysis (EDA) - Analyzes the data to uncover patterns and anomalies.
  • Section 4: Unsupervised Learning - Applies clustering techniques to understand market segmentation.
  • Section 5: Supervised Learning - Discusses the development and validation of predictive models.
  • Section 6: Conclusion - Summarizes the findings, discusses the implications, and suggests areas for further research.

2 Data

  • Sources
  • Description
  • Wrangling/cleaning
  • Spotting mistakes and missing data (could be part of EDA too)
  • Listing anomalies and outliers (could be part of EDA too)

2.1 Wrangling and Cleaning

  • ajouter source
  • ajouter description
  • expliquer blabla
  • Explain why we remove NA from m2 column.
  • Explain …

2.1.1 Raw dataset

Code
properties <- read.csv(file.path(here(),"data/properties.csv"))
# show 1000 first rows of properties using reactable
reactable(head(properties, 1000))
Code

# Create a tibble with cantons and observations
observations_table <- tibble(
  Canton = c("Vaud", "Bern", "Lucerne", "Zurich", "Uri", "Schwyz",
             "Obwalden", "Nidwalden", "Glarus", "St. Gallen", "Grisons", 
             "Aargau", "Thurgau", "Ticino", "Valais", "Neuchatel", 
             "Geneva", "Jura", "Zug", "Fribourg", "Solothurn", 
             "Basel-Stadt", "Basel-Landschaft", "Schaffhausen", 
             "Appenzell-Ausser-Rhoden", "Appenzell-Inner-Rhoden", "Total"),
  Observations = c(3232, 1553, 376, 1191, 71, 93, 29, 51, 55, 757, 405,
                   1481, 553, 4230, 3601, 513, 629, 329, 69, 1242, 590, 
                   149, 705, 118, 102, 12, sum(c(3232, 1553, 376, 1191, 71, 93, 29, 51, 55, 757, 405,
                                               1481, 553, 4230, 3601, 513, 629, 329, 69, 1242, 590, 
                                               149, 705, 118, 102, 12)))
)

# Display the table using kable and kableExtra
observations_table %>%
  kbl(caption = "Number of Observations by Canton") %>%
  kable_styling(position = "center", bootstrap_options = c("striped", "bordered", "hover")) %>%
  add_header_above(c(" " = 1, "Observations" = 1)) # Adds headers spanning columns
Number of Observations by Canton
Observations
Canton Observations
Vaud 3232
Bern 1553
Lucerne 376
Zurich 1191
Uri 71
Schwyz 93
Obwalden 29
Nidwalden 51
Glarus 55
St. Gallen 757
Grisons 405
Aargau 1481
Thurgau 553
Ticino 4230
Valais 3601
Neuchatel 513
Geneva 629
Jura 329
Zug 69
Fribourg 1242
Solothurn 590
Basel-Stadt 149
Basel-Landschaft 705
Schaffhausen 118
Appenzell-Ausser-Rhoden 102
Appenzell-Inner-Rhoden 12
Total 22136

2.1.2 Cleaning

Code
# Identify values causing the issue
problematic_values <- properties$number_of_rooms[is.na(as.numeric(properties$number_of_rooms))]
#> Warning: NAs introduced by coercion
# Replace non-numeric values with NA
#properties$number_of_rooms <- as.numeric(gsub("[^0-9.]", "", properties$number_of_rooms))

# Remove non-numeric characters and convert to numeric
properties$price <- as.numeric(gsub("[^0-9]", "", properties$price))

# Subset the dataset to exclude rows with price < 20000
properties <- properties[properties$price >= 20000, ]

# Subset the dataset to exclude rows with numbers of rooms < 25
#properties <- properties[properties$number_of_rooms <25, ]

# Replace incomplete addresses
properties$address <- gsub("^\\W*[.,0-]\\W*", "", properties$address)

properties_filtered <- na.omit(properties)

properties_filtered$year_category <- substr(properties_filtered$year_category, 1, 9)
# Assuming 'year_category' is a column in the 'properties' dataset
properties_filtered$year_category <- as.factor(properties_filtered$year_category)


# remove m^2 from column 'square_meters'
properties_filtered$square_meters <- as.numeric(gsub("\\D", "", properties_filtered$square_meters))
# print how many NA observations left in square_meters
print(sum(is.na(properties_filtered$square_meters)))
#> [1] 1056
# remove NA
properties_filtered <- properties_filtered[!is.na(properties_filtered$square_meters),]
# add majuscule to canton
properties_filtered$canton <- tools::toTitleCase(properties_filtered$canton)

# # Preprocess the number_of_rooms column
properties_filtered$number_of_rooms <- gsub(" rooms", "", properties_filtered$number_of_rooms)
properties_filtered$number_of_rooms <- gsub(" room", "", properties_filtered$number_of_rooms)
# Remove rows with "m²" in the "number_of_rooms" column
properties_filtered <- properties_filtered[!grepl("m²", properties_filtered$number_of_rooms), ]
properties_filtered$number_of_rooms <- as.numeric(properties_filtered$number_of_rooms)
# Remove rows with rooms >= 100
properties_filtered <- properties_filtered[properties_filtered$number_of_rooms <= 100, , drop = FALSE]
# Divide cells by 10 if number of rooms is more than 27
properties_filtered$number_of_rooms <- ifelse(properties_filtered$number_of_rooms > 27,
                                               properties_filtered$number_of_rooms / 10,
                                               properties_filtered$number_of_rooms)





# properties_filtered$number_of_rooms <- as.character(properties_filtered$number_of_rooms)
# properties_filtered$number_of_rooms <- gsub("\\D", properties_filtered$number_of_rooms)  # Remove non-numeric characters
# properties_filtered$number_of_rooms <- as.numeric(properties_filtered$number_of_rooms)       # Convert to numeric
# properties_filtered$number_of_rooms <- trunc(properties_filtered$number_of_rooms)             # Truncate non-integer values

# show 100 first row of cleaned dataset using reactable
reactable(head(properties_filtered, 100))
Code

#filter properties_filtered to contains only 'price', 'number_of_rooms', 'square_meters'
properties_summary <- properties_filtered[, c('price', 'number_of_rooms', 'square_meters')]
#summary statistics
summary(properties_summary)
#>      price          number_of_rooms square_meters 
#>  Min.   :   25000   Min.   : 1.00   Min.   :   1  
#>  1st Qu.:  690000   1st Qu.: 3.50   1st Qu.:  99  
#>  Median :  992340   Median : 4.50   Median : 136  
#>  Mean   : 1351000   Mean   : 4.98   Mean   : 160  
#>  3rd Qu.: 1550000   3rd Qu.: 5.50   3rd Qu.: 190  
#>  Max.   :26149500   Max.   :27.00   Max.   :2700
# Data
data <- data.frame(
  price = c(25000, 690000, 992340, 1348429, 1550000, 26149500),
  number_of_rooms = c(1.0, 35.0, 45.0, 41.1, 55.0, 185.0),
  square_meters = c(1, 98, 136, 159, 190, 2000)
)

# Summary statistics
summary_stats <- summary(data)

# Summary statistics
summary_stats <- cbind(
  Min = apply(data, 2, min),
  Q1 = apply(data, 2, quantile, probs = 0.25),
  Median = apply(data, 2, median),
  Mean = apply(data, 2, mean),
  Q3 = apply(data, 2, quantile, probs = 0.75),
  Max = apply(data, 2, max)
)

# Create table
table <- kbl(summary_stats, align = rep('c', 6), caption = "Summary statistics for the dataset") %>%
  kable_styling(position = "center", bootstrap_options = c("striped", "hover", "condensed", "responsive"))
table
Summary statistics for the dataset
Min Q1 Median Mean Q3 Max
price 25000 765585.0 1170384 5.13e+06 1.50e+06 26149500
number_of_rooms 1 36.5 43 6.04e+01 5.25e+01 185
square_meters 1 107.5 148 4.31e+02 1.82e+02 2000

2.1.3 AMTOVZ_CSV_LV95 Data

The dataset described is the “Official Index of Localities” (Répertoire officiel des localités) provided by the Swiss Federal Office of Topography (swisstopo). It contains comprehensive information about all localities in Switzerland and the Principality of Liechtenstein, including their names, postal codes, and perimeters.

This dataset is regularly updated on a monthly basis, incorporating changes reported by cantonal authorities and Swiss Post. It serves various purposes, including spatial analysis, integration with other geographic datasets, usage as a geolocated background in GIS (Geographic Information Systems) and CAD (Computer-Aided Design) systems, statistical analysis, and as a reference dataset for information systems.

Updates and release notes for the dataset are provided periodically, detailing changes and improvements made over time. The Swiss Federal Office of Topography manages and distributes this dataset as part of its responsibilities in collecting and providing official geospatial data for Switzerland.

Source - swisstopo

2.1.3.1 Creating Variable zip_code and merging with AMTOVZ_CSV_LV95

Code
df <- properties_filtered
#the address column is like : '1844 Villeneuve VD' and has zip code number in it
#taking out the zip code number and creating a new column 'zip_code'
#the way to identify the zip code is to identify numbers that are 4 digits long
df$zip_code <- as.numeric(gsub("\\D", "", df$address))
#removing the first two number of zip code has more than 4 number
df$zip_code <- ifelse(df$zip_code > 9999, df$zip_code %% 10000, df$zip_code)

2.1.3.2 Using AMTOVZ_CSV_LV95 to get the city and canton from the zip code

Code
#read .csv AMTOVZ_CSV_LV95
amto <- read.csv(file.path(here(),"data/AMTOVZ_CSV_WGS84.csv"), sep = ";")
#creating a new dataframe with 'Ortschaftsname' as 'City'Place_name', 'PLZ' as 'zip_code', 'KantonskÃ.rzel' as 'Canton_code', 'E' as 'lon' and 'N' as 'lat'
amto_df <- amto[, c('Gemeindename', 'PLZ', 'Kantonskürzel', 'E', 'N')]
#renaming the columns
colnames(amto_df) <- c('Community', 'zip_code', 'Canton_code', 'lon', 'lat')

#remove duplicates of zip code
amto_df <- amto_df[!duplicated(amto_df$zip_code),]

#add the variable of amto_df to the df if the zip code matches
df <- merge(df, amto_df, by = "zip_code", all.x = TRUE)

#check if there are nan in city
df[is.na(df$Community),]
#>       zip_code    price number_of_rooms square_meters
#> 1           25  2200000            10.0           263
#> 2           25  2200000             6.5           165
#> 3           26   655000             3.5            66
#> 4           26  1995000             7.5           180
#> 5          322   870000             2.5            59
#> 6          322   880000             2.5            55
#> 7          322   975000             3.5            56
#> 230       1014  1510000             5.5           146
#> 1137      1200 16092000             7.0           400
#> 1138      1200   679000             5.5           142
#> 1139      1200  3285450             5.0           230
#> 5481      1919  2558620             5.5           270
#> 5482      1919  1908000             6.5           210
#> 5483      1919  1065000             4.5           130
#> 5484      1919   785000             3.5           103
#> 7624      2500  1100000             5.0           154
#> 7625      2500   872500             4.5           144
#> 7626      2500   420000             4.5           115
#> 7627      2500  1450000             5.5           198
#> 7628      2500   885500             5.5           130
#> 7629      2500   872500             4.5           138
#> 7630      2500   892500             4.5           144
#> 7631      2500   885500             5.5           130
#> 7632      2500   887500             5.5           130
#> 7633      2500  1050000             4.5           121
#> 7634      2500   877500             4.5           138
#> 7635      2500   870500             4.5           125
#> 7636      2500   887500             4.5           144
#> 8328      3000   820000             5.5           165
#> 8329      3000  1140000             3.5           115
#> 8330      3000  1090000             3.5           115
#> 8331      3000  1090000             5.5           193
#> 8332      3000  1090000             5.5           193
#> 8333      3000   720000             3.5           102
#> 8334      3000   920000             4.5           157
#> 8335      3000   920000             4.5           157
#> 8336      3000  1590000             5.5           330
#> 10437     4000   975000             4.5           125
#> 10438     4000   180000             3.0            70
#> 10439     4000  2100000             6.5           360
#> 12362     5201   725000             3.5            95
#> 13215     6000   695000             4.5           133
#> 13968     6511   440000             2.0            64
#> 14244     6547 15000000             7.5           220
#> 14562     6602  2800000             6.5           250
#> 14563     6602  2800000             7.5           242
#> 14564     6602   450000             3.5            75
#> 14565     6602   270000             1.5            28
#> 14566     6604   760000             3.5            78
#> 14567     6604  1990000             4.5           220
#> 14568     6604  2668590             5.5           290
#> 16581     6901  3660930             4.5           290
#> 16582     6901  3660930             4.5           290
#> 16583     6903   790000             3.5           105
#> 16584     6907   995000             4.5           114
#> 16585     6907   995000             4.5           114
#> 16586     6911   469350             5.5           140
#> 16587     6911   737550             4.5            82
#> 16588     6911   660000             7.5           200
#> 16589     6911   610000             3.5           103
#> 17900     7133  2266290             5.5           160
#> 17909     7135  2690000             8.5           236
#> 18169     8000  2100000             4.5           152
#> 18170     8000  1650000             4.5           142
#> 18171     8000   925000             3.5           102
#> 18172     8000  1650000             4.5           142
#> 18173     8000  1150000             4.5           128
#> 18174     8000  1450000             5.5           143
#> 18175     8000  1990000             5.5           200
#> 18176     8000   975000             4.5           122
#> 18177     8000  1990000             5.5           200
#> 18178     8000  2495000             5.5           482
#> 18658     8238   245000             2.0            49
#> 19082     8423  2110000             6.5           204
#> 19083     8423  2190000             5.5           167
#> 20296     9241   545000             4.5           100
#> 20297     9241   730840             5.5           130
#>                                                  address
#> 1                                       1000 Lausanne 25
#> 2                                       1000 Lausanne 25
#> 3                                       1000 Lausanne 26
#> 4                          Lausanne 26, 1000 Lausanne 26
#> 5                    Via Cuolm Liung 30d, 7032 Laax GR 2
#> 6                                         7032 Laax GR 2
#> 7                       Via Murschetg 29, 7032 Laax GR 2
#> 230                                        1014 Lausanne
#> 1137                                         1200 Genève
#> 1138  Chemin des pralets, 74100 Etrembières, 1200 Genève
#> 1139                                         1200 Genève
#> 5481                                       1919 Martigny
#> 5482                                       1919 Martigny
#> 5483                                       1919 Martigny
#> 5484                                       1919 Martigny
#> 7624                                    2500 Biel/Bienne
#> 7625                                    2500 Biel/Bienne
#> 7626                                    2500 Biel/Bienne
#> 7627                                         2500 Bienne
#> 7628                                    2500 Biel/Bienne
#> 7629                                    2500 Biel/Bienne
#> 7630                                    2500 Biel/Bienne
#> 7631                                    2500 Biel/Bienne
#> 7632                                    2500 Biel/Bienne
#> 7633                     Hohlenweg 11b, 2500 Biel/Bienne
#> 7634                                    2500 Biel/Bienne
#> 7635                                    2500 Biel/Bienne
#> 7636                                    2500 Biel/Bienne
#> 8328                                           3000 Bern
#> 8329                                           3000 Bern
#> 8330                                           3000 Bern
#> 8331                                           3000 Bern
#> 8332                                           3000 Bern
#> 8333                                           3000 Bern
#> 8334                                           3000 Bern
#> 8335                                           3000 Bern
#> 8336                                           3000 Bern
#> 10437                                         4000 Basel
#> 10438           Lörrach Brombach Steinsack 6, 4000 Basel
#> 10439                                         4000 Basel
#> 12362                                      5201 Brugg AG
#> 13215   in TRIENGEN, ca. 20 min. bei Luzern, 6000 Luzern
#> 13968                                     6511 Cadenazzo
#> 14244                               Augio 1F, 6547 Augio
#> 14562                                       6602 Muralto
#> 14563                                       6602 Muralto
#> 14564                      Via Bacilieri 2, 6602 Muralto
#> 14565                                       6602 Muralto
#> 14566                                       6604 Locarno
#> 14567                                       6604 Solduno
#> 14568                                       6604 Solduno
#> 16581                                        6901 Lugano
#> 16582                                        6901 Lugano
#> 16583                                        6903 Lugano
#> 16584                                      6907 MASSAGNO
#> 16585                                      6907 MASSAGNO
#> 16586                             6911 Campione d'Italia
#> 16587                             6911 Campione d'Italia
#> 16588                             6911 Campione d'Italia
#> 16589                             6911 Campione d'Italia
#> 17900                  Inder Platenga 34, 7133 Obersaxen
#> 17909                                       7135 Fideris
#> 18169                                        8000 Zürich
#> 18170                                        8000 Zürich
#> 18171                                        8000 Zürich
#> 18172                                        8000 Zürich
#> 18173                                        8000 Zürich
#> 18174                                        8000 Zürich
#> 18175                                        8000 Zürich
#> 18176                                        8000 Zürich
#> 18177                                        8000 Zürich
#> 18178                                        8000 Zürich
#> 18658      Stemmerstrasse 14, 8238 Büsingen am Hochrhein
#> 19082                      Chüngstrasse 60, 8423 Embrach
#> 19083                      Chüngstrasse 48, 8423 Embrach
#> 20296                                       9241 Kradolf
#> 20297                                       9241 Kradolf
#>             canton    property_type floor year_category Community
#> 1             Vaud     Single house           1919-1945      <NA>
#> 2             Vaud            Villa           2006-2010      <NA>
#> 3             Vaud        Apartment noteg     2016-2024      <NA>
#> 4             Vaud            Villa           1961-1970      <NA>
#> 5          Grisons        Apartment    eg     2016-2024      <NA>
#> 6          Grisons        Apartment noteg     2016-2024      <NA>
#> 7          Grisons        Apartment noteg     2011-2015      <NA>
#> 230           Vaud        Apartment    eg     2011-2015      <NA>
#> 1137        Geneva     Single house           2011-2015      <NA>
#> 1138        Geneva Bifamiliar house           2016-2024      <NA>
#> 1139        Geneva Bifamiliar house           1981-1990      <NA>
#> 5481        Valais       Attic flat noteg     2016-2024      <NA>
#> 5482        Valais        Apartment noteg     2016-2024      <NA>
#> 5483        Valais        Apartment noteg     2016-2024      <NA>
#> 5484        Valais        Apartment noteg     2016-2024      <NA>
#> 7624          Bern     Single house           2001-2005      <NA>
#> 7625          Bern            Villa           2016-2024      <NA>
#> 7626          Bern        Apartment noteg     1971-1980      <NA>
#> 7627          Bern     Single house           2016-2024      <NA>
#> 7628          Bern            Villa           2016-2024      <NA>
#> 7629          Bern     Single house           2016-2024      <NA>
#> 7630          Bern     Single house           2016-2024      <NA>
#> 7631          Bern     Single house           2016-2024      <NA>
#> 7632          Bern     Single house           2016-2024      <NA>
#> 7633          Bern     Single house           2001-2005      <NA>
#> 7634          Bern     Single house           2016-2024      <NA>
#> 7635          Bern     Single house           2016-2024      <NA>
#> 7636          Bern     Single house           2016-2024      <NA>
#> 8328          Bern        Apartment noteg     2016-2024      <NA>
#> 8329          Bern        Apartment    eg     2016-2024      <NA>
#> 8330          Bern        Apartment    eg     2016-2024      <NA>
#> 8331          Bern        Roof flat noteg     2016-2024      <NA>
#> 8332          Bern        Apartment noteg     2016-2024      <NA>
#> 8333          Bern        Apartment    eg     2016-2024      <NA>
#> 8334          Bern           Duplex noteg     2016-2024      <NA>
#> 8335          Bern        Apartment noteg     2016-2024      <NA>
#> 8336          Bern        Apartment noteg     1991-2000      <NA>
#> 10437  Basel-Stadt     Single house           2016-2024      <NA>
#> 10438  Basel-Stadt     Single house           1961-1970      <NA>
#> 10439  Basel-Stadt            Villa           2016-2024      <NA>
#> 12362       Aargau        Apartment noteg     2016-2024      <NA>
#> 13215      Lucerne        Apartment noteg     1991-2000      <NA>
#> 13968       Ticino        Apartment noteg     2016-2024      <NA>
#> 14244      Grisons     Single house           2016-2024      <NA>
#> 14562       Ticino     Single house           1981-1990      <NA>
#> 14563       Ticino     Single house           1981-1990      <NA>
#> 14564       Ticino        Apartment noteg     1946-1960      <NA>
#> 14565       Ticino        Apartment    eg     1961-1970      <NA>
#> 14566       Ticino        Apartment noteg     2011-2015      <NA>
#> 14567       Ticino       Attic flat noteg     2011-2015      <NA>
#> 14568       Ticino        Apartment noteg     2011-2015      <NA>
#> 16581       Ticino       Attic flat noteg     2011-2015      <NA>
#> 16582       Ticino        Apartment noteg     2011-2015      <NA>
#> 16583       Ticino        Apartment noteg     2006-2010      <NA>
#> 16584       Ticino        Apartment noteg     2016-2024      <NA>
#> 16585       Ticino        Apartment noteg     2016-2024      <NA>
#> 16586       Ticino        Apartment noteg     1946-1960      <NA>
#> 16587       Ticino        Apartment noteg     1991-2000      <NA>
#> 16588       Ticino     Single house           1971-1980      <NA>
#> 16589       Ticino        Apartment    eg     1946-1960      <NA>
#> 17900      Grisons     Single house           2006-2010      <NA>
#> 17909      Grisons     Single house              0-1919      <NA>
#> 18169       Zurich        Apartment noteg     2016-2024      <NA>
#> 18170       Zurich       Attic flat noteg     2016-2024      <NA>
#> 18171       Zurich        Apartment noteg     2016-2024      <NA>
#> 18172       Zurich        Apartment noteg     2016-2024      <NA>
#> 18173       Zurich        Apartment noteg     2016-2024      <NA>
#> 18174       Zurich        Apartment    eg     2016-2024      <NA>
#> 18175       Zurich        Apartment noteg     2006-2010      <NA>
#> 18176       Zurich     Single house           2016-2024      <NA>
#> 18177       Zurich       Attic flat noteg     2006-2010      <NA>
#> 18178       Zurich        Apartment noteg        0-1919      <NA>
#> 18658 Schaffhausen        Apartment noteg     1961-1970      <NA>
#> 19082       Zurich Bifamiliar house           2016-2024      <NA>
#> 19083       Zurich     Single house           2016-2024      <NA>
#> 20296      Thurgau        Apartment noteg     1991-2000      <NA>
#> 20297      Thurgau        Apartment noteg     1991-2000      <NA>
#>       Canton_code lon lat
#> 1            <NA>  NA  NA
#> 2            <NA>  NA  NA
#> 3            <NA>  NA  NA
#> 4            <NA>  NA  NA
#> 5            <NA>  NA  NA
#> 6            <NA>  NA  NA
#> 7            <NA>  NA  NA
#> 230          <NA>  NA  NA
#> 1137         <NA>  NA  NA
#> 1138         <NA>  NA  NA
#> 1139         <NA>  NA  NA
#> 5481         <NA>  NA  NA
#> 5482         <NA>  NA  NA
#> 5483         <NA>  NA  NA
#> 5484         <NA>  NA  NA
#> 7624         <NA>  NA  NA
#> 7625         <NA>  NA  NA
#> 7626         <NA>  NA  NA
#> 7627         <NA>  NA  NA
#> 7628         <NA>  NA  NA
#> 7629         <NA>  NA  NA
#> 7630         <NA>  NA  NA
#> 7631         <NA>  NA  NA
#> 7632         <NA>  NA  NA
#> 7633         <NA>  NA  NA
#> 7634         <NA>  NA  NA
#> 7635         <NA>  NA  NA
#> 7636         <NA>  NA  NA
#> 8328         <NA>  NA  NA
#> 8329         <NA>  NA  NA
#> 8330         <NA>  NA  NA
#> 8331         <NA>  NA  NA
#> 8332         <NA>  NA  NA
#> 8333         <NA>  NA  NA
#> 8334         <NA>  NA  NA
#> 8335         <NA>  NA  NA
#> 8336         <NA>  NA  NA
#> 10437        <NA>  NA  NA
#> 10438        <NA>  NA  NA
#> 10439        <NA>  NA  NA
#> 12362        <NA>  NA  NA
#> 13215        <NA>  NA  NA
#> 13968        <NA>  NA  NA
#> 14244        <NA>  NA  NA
#> 14562        <NA>  NA  NA
#> 14563        <NA>  NA  NA
#> 14564        <NA>  NA  NA
#> 14565        <NA>  NA  NA
#> 14566        <NA>  NA  NA
#> 14567        <NA>  NA  NA
#> 14568        <NA>  NA  NA
#> 16581        <NA>  NA  NA
#> 16582        <NA>  NA  NA
#> 16583        <NA>  NA  NA
#> 16584        <NA>  NA  NA
#> 16585        <NA>  NA  NA
#> 16586        <NA>  NA  NA
#> 16587        <NA>  NA  NA
#> 16588        <NA>  NA  NA
#> 16589        <NA>  NA  NA
#> 17900        <NA>  NA  NA
#> 17909        <NA>  NA  NA
#> 18169        <NA>  NA  NA
#> 18170        <NA>  NA  NA
#> 18171        <NA>  NA  NA
#> 18172        <NA>  NA  NA
#> 18173        <NA>  NA  NA
#> 18174        <NA>  NA  NA
#> 18175        <NA>  NA  NA
#> 18176        <NA>  NA  NA
#> 18177        <NA>  NA  NA
#> 18178        <NA>  NA  NA
#> 18658        <NA>  NA  NA
#> 19082        <NA>  NA  NA
#> 19083        <NA>  NA  NA
#> 20296        <NA>  NA  NA
#> 20297        <NA>  NA  NA

We have 77 NAN, where

  • The zip code was not found in the atmo df
  • The zip code was incorectly isolated from the address

Removed them ::: {.cell layout-align=“center”}

Code
#remove the rows with nan in city
properties_filtered <- df[!is.na(df$Community),]
reactable(head(properties_filtered, 100))

:::

2.1.4 Tax data

  • source https://swisstaxcalculator.estv.admin.ch/#/taxdata/tax-rates
  • ajouter description
  • expliquer blabla

2.1.4.1 Cleaning

Code
# read csv
impots <- read.csv(file.path(here(),"data/estv_income_rates.csv"), sep = ",", header = TRUE, stringsAsFactors = FALSE)

# Remove 1st row
impots <- impots[-1, ]
# Remove 3rd column
impots <- impots[, -3]

# Combine text for columns 4-8
impots[1, 4:8] <- "Impôt sur le revenu"
# Combine text for columns 9-13
impots[1, 9:13] <- "Impôt sur la fortune"
# Combine text for columns 14-16
impots[1, 14:16] <- "Impôt sur le bénéfice"
# Combine text for columns 17-19
impots[1, 17:19] <- "Impôt sur le capital"

# Combine content of the first 2 rows into the 2nd row
impots[2, ] <- apply(impots[1:2, ], 2, function(x) paste(ifelse(is.na(x[1]), x[2], ifelse(is.na(x[2]), x[1], paste(x[1], x[2], sep = " ")))))

# Remove 1st row
impots <- impots[-1, ]

# Assign the text to the 1st row and 1st column
impots[1, 1] <- "Coefficient d'impôt en %"
# Replace column names with the content of the first row
colnames(impots) <- impots[1, ]
impots <- impots[-1, ]

# Check for missing values in impots
any_missing <- any(is.na(impots))

if (any_missing) {
  print("There are missing values in impots.")
} else {
  print("There are no missing values in impots.")
}
#> [1] "There are no missing values in impots."


# Replace row names with the content of the 3rd column
row.names(impots) <- impots[, 3]
impots <- impots[, -3]

# Remove 2nd column (to avoid canton column)
impots <- impots[, -2]
# Remove impot eglise
impots <- impots[, -c(4:6)]
impots <- impots[, -c(6:8)]
impots <- impots[, -8]
impots <- impots[, -10]
# Clean data and convert to numeric
cleaned_impots <- apply(impots, 2, function(x) as.numeric(gsub("[^0-9.-]", "", x)))

# Replace NA values with 0
cleaned_impots[is.na(cleaned_impots)] <- 0

# Check for non-numeric values
non_numeric <- sum(!is.na(cleaned_impots) & !is.numeric(cleaned_impots))
if (non_numeric > 0) {
  print(paste("Warning: Found", non_numeric, "non-numeric values."))
}

rownames(cleaned_impots) <- rownames(impots)

#reactable(head(cleaned_impots, 100))

2.1.5 Commune Data

2.1.5.1 Cleaning

  • ajouter source
  • ajouter description
  • expliquer blabla

Replaces NAs in both Taux de couverture social and Political (Conseil National Datas) For Taux de couverture Social: NAs were due to reason “Q” = “Not indicated to protect confidentiality” We replaced the NAs by the average taux de couverture in Switzerland in 2019, which was 3.2%

For Political data: NAs were due to reason “M” = “Not indicated because data was not important or applicable” Therefore, we replaced the NAs by 0

Code
# il faudra changer le path
commune_prep <- read.csv(file.path(here(),"data/commune_data.csv"), sep = ";", header = TRUE, stringsAsFactors = FALSE)

# We keep only 2019 to have some reference? (2020 is apparently not really complete)
commune_2019 <- subset(commune_prep, PERIOD_REF == "2019") %>%
  select(c("REGION", "CODE_REGION", "INDICATORS", "VALUE", "STATUS"))

# delete les lignes ou Status = Q ou M (pas de valeur) et ensuite on enlève la colonne
commune_2019 <- subset(commune_2019, STATUS == "A") %>%
  select(c("REGION", "CODE_REGION", "INDICATORS", "VALUE"))

# on enlève les lignes qui sont des aggrégats
commune_2019 <- subset(commune_2019, REGION != "Schweiz")

commune_2019 <- commune_2019 %>%
  pivot_wider(names_from = INDICATORS, values_from = VALUE)

# Rename columns using the provided map
commune <- commune_2019 %>%
  rename(`Population - Habitants` = Ind_01_01,
         `Population - Densité de la population` = Ind_01_03,
         `Population - Etrangers` = Ind_01_08,
         `Population - Part du groupe d'âge 0-19 ans` = Ind_01_04,
         `Population - Part du groupe d'âge 20-64 ans` = Ind_01_05,
         `Population - Part du groupe d'âge 65+ ans` = Ind_01_06,
         `Population - Taux brut de nuptialité` = Ind_01_09,
         `Population - Taux brut de divortialité` = Ind_01_10,
         `Population - Taux brut de natalité` = Ind_01_11,
         `Population - Taux brut de mortalité` = Ind_01_12,
         `Population - Ménages privés` = Ind_01_13,
         `Population - Taille moyenne des ménages` = Ind_01_14,
         `Sécurité sociale - Taux d'aide sociale` = Ind_11_01,
         `Conseil national - PLR` = Ind_14_01,
         `Conseil national - PDC` = Ind_14_02,
         `Conseil national - PS` = Ind_14_03,
         `Conseil national - UDC` = Ind_14_04,
         `Conseil national - PEV/PCS` = Ind_14_05,
         `Conseil national - PVL` = Ind_14_06,
         `Conseil national - PBD` = Ind_14_07,
         `Conseil national - PST/Sol.` = Ind_14_08,
         `Conseil national - PES` = Ind_14_09,
         `Conseil national - Petits partis de droite` = Ind_14_10)

# If no one voted for a party, set as NA -> replacing it with 0 instead
commune <- commune %>%
  mutate_at(vars(starts_with("Conseil national")), ~replace_na(., 0))


# Removing NAs from Taux de couverture sociale column
# Setting the mean as the mean for Switzerland in 2019 (3.2%)
mean_taux_aide_social <- 3.2

# Replace NA values with the mean
commune <- commune %>%
  mutate(`Sécurité sociale - Taux d'aide sociale` = if_else(is.na(`Sécurité sociale - Taux d'aide sociale`), mean_taux_aide_social, `Sécurité sociale - Taux d'aide sociale`))
#show 100 first rows of commune using reactable
reactable(head(commune, 100))
Code

# commune_prep <- read.csv(file.path(here(),"data/commune_data.csv"), sep = ";", header = TRUE, stringsAsFactors = FALSE)
# 
# # We keep only 2019 to have some reference? (2020 is apparently not really complete)
# commune_2019 <- subset(commune_prep, PERIOD_REF == "2019") %>%
#   select(c("REGION", "CODE_REGION", "INDICATORS", "VALUE", "STATUS"))
# 
# # delete les lignes ou Status = Q ou M (pas de valeur) et ensuite on enlève la colonne
# commune_2019 <- subset(commune_2019, STATUS == "A") %>%
#   select(c("REGION", "CODE_REGION", "INDICATORS", "VALUE"))
# 
# # on enlève les lignes qui sont des aggrégats
# commune_2019 <- subset(commune_2019, REGION != "Schweiz")
# 
# commune_2019 <- commune_2019 %>%
#   pivot_wider(names_from = INDICATORS, values_from = VALUE)
# 
# # Rename columns using the provided map
# commune <- commune_2019 %>%
#   rename(`Population - Habitants` = Ind_01_01,
#          `Population - Densité de la population` = Ind_01_03,
#          `Population - Etrangers` = Ind_01_08,
#          `Population - Part du groupe d'âge 0-19 ans` = Ind_01_04,
#          `Population - Part du groupe d'âge 20-64 ans` = Ind_01_05,
#          `Population - Part du groupe d'âge 65+ ans` = Ind_01_06,
#          `Population - Taux brut de nuptialité` = Ind_01_09,
#          `Population - Taux brut de divortialité` = Ind_01_10,
#          `Population - Taux brut de natalité` = Ind_01_11,
#          `Population - Taux brut de mortalité` = Ind_01_12,
#          `Population - Ménages privés` = Ind_01_13,
#          `Population - Taille moyenne des ménages` = Ind_01_14,
#          `Sécurité sociale - Taux d'aide sociale` = Ind_11_01,
#          `Conseil national - PLR` = Ind_14_01,
#          `Conseil national - PDC` = Ind_14_02,
#          `Conseil national - PS` = Ind_14_03,
#          `Conseil national - UDC` = Ind_14_04,
#          `Conseil national - PEV/PCS` = Ind_14_05,
#          `Conseil national - PVL` = Ind_14_06,
#          `Conseil national - PBD` = Ind_14_07,
#          `Conseil national - PST/Sol.` = Ind_14_08,
#          `Conseil national - PES` = Ind_14_09,
#          `Conseil national - Petits partis de droite` = Ind_14_10)
# 
# # If no one voted for a party, set as NA -> replacing it with 0 instead
# commune <- commune %>%
#   mutate_at(vars(starts_with("Conseil national")), ~replace_na(., 0))
# 
# 
# # Removing NAs from Taux de couverture sociale column
# # Setting the mean as the mean for Switzerland in 2019 (3.2%)
# mean_taux_aide_social <- 3.2
# 
# # Replace NA values with the mean
# commune <- commune %>%
#   mutate(`Sécurité sociale - Taux d'aide sociale` = if_else(is.na(`Sécurité sociale - Taux d'aide sociale`), mean_taux_aide_social, `Sécurité sociale - Taux d'aide sociale`))
# 

3 Unsupervised learning

  • Clustering and/or dimension reduction

Trying to Cluster commune datas to: 1. Reduce dimension 2. See similarities

A regarder, est-ce qu’on fait un cluster pour les datas politques + un cluster pour les data démographiques, ou est-ce qu’on regroupe tout?

Code
set.seed(100)

# Clustering demographic
cols_commune_demographic <- select(commune, -c("REGION", "CODE_REGION","Conseil national - PLR","Conseil national - PDC", "Conseil national - PS", "Conseil national - UDC", "Conseil national - PEV/PCS", "Conseil national - PVL", "Conseil national - PBD", "Conseil national - PST/Sol.", "Conseil national - PES", "Conseil national - Petits partis de droite"))

# Scale the columns, some are total numbers, some are percentages
cols_commune_demographic <- scale(cols_commune_demographic)

# Calculate the distance matrix
dist_matrix_demographic <- dist(cols_commune_demographic, method = "minkowski")

# Perform hierarchical clustering
hclust_model_demographic <- hclust(dist_matrix_demographic, method = "ward.D")

# Create dendrogram
dend_demo <- as.dendrogram(hclust_model_demographic)
dend_demo <- color_branches(dend_demo, k = 5) #Set number of cluster to 5, to keep the same scale for all our variables

plot(dend_demo, main = "Demographics - Hierarchical Clustering Dendrogram")

Code
# Clustering politics

cols_commune_politics <- select(commune, c("Conseil national - PLR","Conseil national - PDC", "Conseil national - PS", "Conseil national - UDC", "Conseil national - PEV/PCS", "Conseil national - PVL", "Conseil national - PBD", "Conseil national - PST/Sol.", "Conseil national - PES", "Conseil national - Petits partis de droite"))

# Scale the columns, some are total numbers, some are percentages
cols_commune_politics <- scale(cols_commune_politics)

# Calculate the distance matrix
dist_matrix_politics <- dist(cols_commune_politics, method = "minkowski")

# Perform hierarchical clustering
hclust_model_politics <- hclust(dist_matrix_politics, method = "ward.D")

# Create dendrogram
dend_pol <- as.dendrogram(hclust_model_politics)
dend_pol <- color_branches(dend_pol, k = 5) #Set number of cluster to 5, to keep the same scale for all our variables

plot(dend_pol, main = "Politics - Hierarchical Clustering Dendrogram")

To prevent introducing 10 new types of taxes, we conducted a clustering analysis on the tax dataset to identify which municipalities can be grouped together. Based on the within-cluster sum of squares, we found 5 clusters. These 5 distinct clusters will be assigned to properties to determine which municipalities are subject to a particular type of tax. ## Tax ::: {.cell layout-align=“center”}

Code

# Clean data and convert to numeric
cleaned_impots <- apply(impots, 2, function(x) as.numeric(gsub("[^0-9.-]", "", x)))
cleaned_impots[is.na(cleaned_impots)] <- 0  # Replace NA values with 0

# Scale the features
scaled_impots <- scale(cleaned_impots)

# Perform k-means clustering
k <- 2  # Initial guess for the number of clusters
kmeans_model <- kmeans(scaled_impots, centers = k)

# Check within-cluster sum of squares (elbow method)
wss <- numeric(10)
for (i in 1:10) {
  kmeans_model <- kmeans(scaled_impots, centers = i)
  wss[i] <- sum(kmeans_model$withinss)
}
#plot(1:10, wss, type = "b", xlab = "Number of Clusters", ylab = "Within groups sum of squares")

# Adjust k based on elbow method
k <- 5  

# Perform k-means clustering again with optimal k
kmeans_model <- kmeans(scaled_impots, centers = k)

# Assign cluster labels to dendrogram
clusters <- kmeans_model$cluster

# Plot dendrogram
#colored_dend <- color_branches(dend, k = 5)
#y_zoom_range <- c(0, 80)  # Adjust the y-axis range as needed

#plot(colored_dend, main = "Hierarchical Clustering Dendrogram", horiz = FALSE, ylim = y_zoom_range)

:::

Code
# Get the cluster centers
cluster_centers <- kmeans_model$centers

# Create a data frame with cluster centers
cluster_centers_df <- data.frame(cluster = 1:k, cluster_centers)

# Print cluster centers
print(cluster_centers_df)
#>   cluster Coefficient.d.impôt.en.. Impôt.sur.le.revenu.Canton
#> 1       1                   -0.236                    -0.6810
#> 2       2                    0.754                    -0.5593
#> 3       3                    0.532                    -0.5667
#> 4       4                    1.037                    -0.0443
#> 5       5                   -0.986                     1.6254
#>   Impôt.sur.le.revenu.Commune Impôt.sur.la.fortune.Canton
#> 1                      -0.333                     -0.6771
#> 2                       0.427                     -0.5636
#> 3                      -0.240                     -0.5710
#> 4                      -0.904                     -0.0474
#> 5                       1.355                      1.6258
#>   Impôt.sur.la.fortune.Commune Impôt.sur.le.bénéfice.Canton
#> 1                       -0.334                      -0.6877
#> 2                        0.427                      -2.0874
#> 3                       -0.236                       0.2662
#> 4                       -0.906                       0.0122
#> 5                        1.355                       1.1969
#>   Impôt.sur.le.bénéfice.Commune Impôt.sur.le.capital.Canton
#> 1                       -0.0226                     -0.6974
#> 2                        2.3564                     -2.0637
#> 3                       -1.3419                      0.2339
#> 4                       -0.4186                     -0.0311
#> 5                        1.0201                      1.2674
#>   Impôt.sur.le.capital.Commune
#> 1                       -0.068
#> 2                        2.160
#> 3                       -1.305
#> 4                       -0.442
#> 5                        1.113

# Calculate the size of each cluster
cluster_sizes <- table(kmeans_model$cluster)

# Print cluster sizes
print(cluster_sizes)
#> 
#>   1   2   3   4   5 
#> 777  80 382 395 497

# Get the cluster labels
cluster_labels <- kmeans_model$cluster

# Convert cleaned_impots to a data frame
impots_cluster <- as.data.frame(cleaned_impots)

# Add the cluster labels to cleaned_impots
impots_cluster$cluster <- cluster_labels

rownames(impots_cluster) <- rownames(impots)

impots_cluster <- impots_cluster %>%
  rownames_to_column(var = "Community")
Code
# Preparing df_commune for merging with main dataset

df_commune <- select(commune, REGION)

df_commune$Demographic_cluster <- cutree(hclust_model_demographic, k = 5)
df_commune$Political_cluster <- cutree(hclust_model_politics, k = 5)

# Preparing to merge

merging <- inner_join(amto_df, df_commune, by = c("Community" = "REGION"))

impots_cluster_subset <- impots_cluster[, c("Community", "cluster")]
merging <- merging %>%
  left_join(impots_cluster_subset, by = "Community")

clusters_df <- merging %>%
  rename(Tax_cluster = cluster) %>%
  rename(Commune = Community)

clusters_df <- clusters_df %>%
  select(c("Commune", "zip_code", "Canton_code", "Demographic_cluster", "Political_cluster", "Tax_cluster"))

# Only NAs are for commune Brugg, (written Brugg (AG) in the other data set) -> j'entre le cluster à la mano
clusters_df$Tax_cluster[is.na(clusters_df$Tax_cluster)] <- 2

# adding it to our main data set:
properties_filtered <- merge(properties_filtered, clusters_df[, c("zip_code", "Demographic_cluster", "Political_cluster", "Tax_cluster")], by = "zip_code", all.x = TRUE)
Code
# Dropping 228 rows containing NAs after the merge

# Find rows with NA values in the specified columns
na_rows <- subset(properties_filtered, is.na(Demographic_cluster) | is.na(Political_cluster) | is.na(Tax_cluster))

# Drop the NA rows
properties_filtered <- anti_join(properties_filtered, na_rows, by = "zip_code")
Code
# Interpretaion of demographic clusters
demographic_vars <- select(commune, -c("REGION", "CODE_REGION", "Conseil national - PLR", "Conseil national - PDC", "Conseil national - PS", "Conseil national - UDC", "Conseil national - PEV/PCS", "Conseil national - PVL", "Conseil national - PBD", "Conseil national - PST/Sol.", "Conseil national - PES", "Conseil national - Petits partis de droite"))

# Scale the variables
scaled_demographic_vars <- scale(demographic_vars)

# Convert to data frame
scaled_demographic_vars <- as.data.frame(scaled_demographic_vars)

# Add demographic cluster labels
scaled_demographic_vars$Demographic_cluster <- cutree(hclust_model_demographic, k = 5)

# Melt the dataset to long format
melted_demographic <- melt(scaled_demographic_vars, id.vars = "Demographic_cluster")

# Create boxplots for each variable
for (variable in unique(melted_demographic$variable)) {
  # Calculate quantiles for each combination of variable and cluster
  quantiles <- tapply(melted_demographic$value[melted_demographic$variable == variable], melted_demographic$Demographic_cluster[melted_demographic$variable == variable], quantile, c(0.05, 0.95))
  
  # Determine ylim for each plot
  ylim_min <- min(unlist(quantiles))
  ylim_max <- max(unlist(quantiles))
  
  boxplot(value ~ Demographic_cluster, data = melted_demographic[melted_demographic$variable == variable,],
          main = paste("Boxplot of", variable, "by Demographic Cluster"),
          xlab = "Demographic Cluster",
          ylab = variable,
          ylim = c(ylim_min, ylim_max))
}

Code
# Subset your dataset to include only the variables used to create the political clusters and the political cluster labels
political_vars <- select(commune, c("Conseil national - PLR","Conseil national - PDC", "Conseil national - PS", "Conseil national - UDC", "Conseil national - PEV/PCS", "Conseil national - PVL", "Conseil national - PBD", "Conseil national - PST/Sol.", "Conseil national - PES", "Conseil national - Petits partis de droite"))

# Scale the variables
scaled_political_vars <- scale(political_vars)

# Convert to data frame
scaled_political_vars <- as.data.frame(scaled_political_vars)

# Add political cluster labels
scaled_political_vars$Political_cluster <- cutree(hclust_model_politics, k = 5)

# Melt the dataset to long format
melted_political <- melt(scaled_political_vars, id.vars = "Political_cluster")

# Create boxplots for each variable
for (variable in unique(melted_political$variable)) {
  # Calculate quantiles for each combination of variable and cluster
  quantiles <- tapply(melted_political$value[melted_political$variable == variable], melted_political$Political_cluster[melted_political$variable == variable], quantile, c(0.05, 0.95))
  
  # Determine ylim for each plot
  ylim_min <- min(unlist(quantiles))
  ylim_max <- max(unlist(quantiles))
  
  boxplot(value ~ Political_cluster, data = melted_political[melted_political$variable == variable,],
          main = paste("Boxplot of", variable, "by Political Cluster"),
          xlab = "Political Cluster",
          ylab = variable,
          ylim = c(ylim_min, ylim_max))
}

Code
# Subset your dataset to include only the variables used to create the tax clusters and the tax cluster labels
tax_vars <- select(impots_cluster, -c("Community", "cluster"))

# Scale the variables
scaled_tax_vars <- scale(tax_vars)

# Convert to data frame
scaled_tax_vars <- as.data.frame(scaled_tax_vars)

# Add tax cluster labels
scaled_tax_vars$Tax_cluster <- impots_cluster$cluster

# Melt the dataset to long format
melted_tax <- melt(scaled_tax_vars, id.vars = "Tax_cluster")

# Create boxplots for each variable
for (variable in unique(melted_tax$variable)) {
  # Calculate quantiles for each combination of variable and cluster
  quantiles <- tapply(melted_tax$value[melted_tax$variable == variable], melted_tax$Tax_cluster[melted_tax$variable == variable], quantile, c(0.05, 0.95))
  
  # Determine ylim for each plot
  ylim_min <- min(unlist(quantiles))
  ylim_max <- max(unlist(quantiles))
  
  boxplot(value ~ Tax_cluster, data = melted_tax[melted_tax$variable == variable,],
          main = paste("Boxplot of", variable, "by Tax Cluster"),
          xlab = "Tax Cluster",
          ylab = variable,
          ylim = c(ylim_min, ylim_max))
}

4 EDA

4.1 Map representation of distribution of properties

Here we decided to represent the distribution of properties in Switzerland using a map. The map is interactive and allows users to hover over the markers to see the price. The markers are color-coded in orange and have a semi-transparent fill to reduce visual noise. The size of the markers is smaller to optimize the visual representation of the data.

This visualization helps in understanding the geographical spread and density of properties in the dataset.

Code
# Create a leaflet map with optimized markers
map <- leaflet(properties_filtered) %>%
  addTiles() %>%  # Add default OpenStreetMap tiles
  addProviderTiles(providers$Esri.NatGeoWorldMap) %>% # Add topographic maps for context
  addCircleMarkers(
    ~lon, ~lat,
    radius = 1.5,  # Smaller radius for the circle markers
    color = "#F97300",  # Specifying a color for the markers
    fillOpacity = 0.2,  # Semi-transparent fill
    stroke = FALSE,  # No border to the circle markers to reduce visual noise
    popup = ~paste("Price: ", price, "<br>",
                   "Rooms: ", number_of_rooms, "<br>",
                   "Type: ", property_type, "<br>",
                   "Year: ", year_category),
    label = ~paste("Price: ", price)  # Tooltip on hover
  ) %>% addLegend(
    position = "bottomright",  # Position the legend at the bottom right
    colors = "#F97300",  # Use the same color as the markers
    labels = "Properties"  # Label for the legend
  )

map$width <- "100%"  # Set the width of the map to 100%
map$height <- 600  # Set the height of the map to 600 pixels

map

The map highlights that properties are well-distributed throughout the region, with fewer properties in the Alpine region, which is expected due to its mountainous terrain. We thus have a good representation of the data across different cantons and locations and we can use it for further analysis.

4.2 Histogram of prices

Code
histogram_price <- ggplot(properties_filtered, aes(x = price)) +
  geom_histogram(binwidth = 100000, fill = "skyblue", color = "red") +
  labs(title = "Distribution of Prices",
       x = "Price",
       y = "Frequency") +
  theme_minimal()
# Convert ggplot object to plotly object
interactive_histogram_price <- ggplotly(histogram_price, width = 600, height = 500 )
# Display the interactive histogram
interactive_histogram_price

4.3 Price between 0 and 500000

4.3.1 Histogram of prices for each property type

Code
# Create the ggplot object
histogram <- ggplot(properties_filtered, aes(x = price)) +
  geom_histogram(binwidth = 100000, fill = "skyblue", color = "black") +
  facet_wrap(~ property_type, scales = "free", ncol = 2) +
  labs(title = "Distribution of Prices by Property Type",
       x = "Price",
       y = "Frequency") +
  theme_minimal() +
  xlim(0, 5000000)

# Convert ggplot object to plotly object
interactive_histogram <- ggplotly(histogram, width = 600, height = 1000)

# Display the interactive plot
interactive_histogram

4.3.2 Histogram of prices for each year category

Code
# Create a histogram of prices for each year category
histogram <- ggplot(properties_filtered, aes(x = price)) +
  geom_histogram(binwidth = 100000, fill = "skyblue", color = "black") +
  facet_wrap(~ year_category, scales = "free", ncol = 2) +
  labs(title = "Distribution of Prices by Year Category",
       x = "Price",
       y = "Frequency") +
  theme_minimal() +
  xlim(0, 5000000)
# Convert ggplot object to plotly object
interactive_histogram_year <- ggplotly(histogram, width = 600, height = 1000)
# Display the interactive plot
interactive_histogram_year

4.3.3 Histogram of prices for each canton

Code
histogram <- ggplot(properties_filtered, aes(x = price)) +
  geom_histogram(binwidth = 100000, fill = "skyblue", color = "black") +
  facet_wrap(~ canton, scales = "free", ncol = 2) +
  labs(title = "Distribution by Canton for properties between 0 and 5 million",
       x = "Price",
       y = "Frequency") +
  theme(axis.text.y = element_text(size = 2)) +
  theme_minimal() +
  xlim(0, 5200000)

# Convert ggplot object to plotly object with adjusted height
interactive_histogram <- ggplotly(histogram, width = 600, height = 1200) #%>%
  #layout(height = 1000)  # Adjust the height as needed

# Display the interactive plot
interactive_histogram

4.3.4 Histogram of prices for each number of rooms

Code
# Create a histogram of prices for each number of rooms
histogram <- ggplot(properties_filtered, aes(x = price)) +
  geom_histogram(binwidth = 100000, fill = "skyblue", color = "black") +
  facet_wrap(~ number_of_rooms, scales = "free", ncol = 2) +
  labs(title = "Distribution of Prices by Number of Rooms",
       x = "Price",
       y = "Frequency") +
  theme_minimal() +
  xlim(0, 5200000)

# Convert ggplot object to plotly object with adjusted height
interactive_histogram <- ggplotly(histogram, width = 600, height = 2000)%>%
  layout(height = 2000)

# Display the interactive plot
interactive_histogram

4.4 Histogram of properties by square meters

To better see the data, we only show the properties with less than 1000 square meters

Code
histogram <- ggplot(properties_filtered, aes(x = square_meters)) +
  geom_histogram(binwidth = 15, fill = "skyblue", color = "black") +
  labs(title = "Distribution of Properties by Square Meters",
       x = "Square Meters",
       y = "Frequency") +
  theme_minimal() +
  xlim(0,1000)

# Convert ggplot object to plotly object with adjusted height
interactive_histogram <- ggplotly(histogram, width = 600, height = 500 )  # Adjust width and height as needed

# Display the interactive plot
interactive_histogram

4.5 Histogram of prices by Tax Cluster

Code
# Create the boxplot
boxplot <- ggplot(properties_filtered, aes(x = as.factor(Tax_cluster), y = price)) +
  geom_boxplot(fill = "skyblue", color = "black") +
  labs(title = "Boxplot of Property Prices by Tax Cluster",
       x = "Tax Cluster",
       y = "Price") +
  theme_minimal() +
  scale_y_continuous(limits = c(0, 4000000))

# Convert ggplot object to plotly object
interactive_boxplot <- ggplotly(boxplot, width = 600, height = 500 )
interactive_boxplot

4.6 Histogram of prices by Political cluster

Code
# Create the boxplot
boxplot <- ggplot(properties_filtered, aes(x = as.factor(Political_cluster), y = price)) +
  geom_boxplot(fill = "skyblue", color = "black") +
  labs(title = "Boxplot of Property Prices by Political Cluster",
       x = "Political Cluster",
       y = "Price") +
  theme_minimal() +
  scale_y_continuous(limits = c(0, 4000000))

# Convert ggplot object to plotly object
interactive_boxplot <- ggplotly(boxplot, width = 600, height = 500 )
interactive_boxplot

4.7 Histogram of prices by demographic cluster

Code
# Create the boxplot
boxplot <- ggplot(properties_filtered, aes(x = as.factor(Demographic_cluster), y = price)) +
  geom_boxplot(fill = "skyblue", color = "black") +
  labs(title = "Boxplot of Property Prices by demographic Cluster",
       x = "demographic Cluster",
       y = "Price") +
  theme_minimal() +
  scale_y_continuous(limits = c(0, 4000000))

# Convert ggplot object to plotly object
interactive_boxplot <- ggplotly(boxplot, width = 600, height = 500 )
interactive_boxplot

5 Supervised learning

  • Data splitting (if a training/test set split is enough for the global analysis, at least one CV or bootstrap must be used)
  • Two or more models
  • Two or more scores
  • Tuning of one or more hyperparameters per model
  • Interpretation of the model(s)

5.1 Model 1

This section provides a comprehensive outline of the linear regression model and analysis methods employed in the study of property price determinants.

5.1.1 Tools and Software

The study was conducted using R, a programming language and environment widely recognized for its robust capabilities in statistical computing and graphics. Key packages used include:

  • corrplot for visualization of correlation matrices, aiding in preliminary feature selection. car for diagnostic testing and variance inflation factor (VIF) analysis to detect multicollinearity among predictors.
  • caret for creating training and testing sets, and managing cross-validation processes.
  • ggplot2 and plotly for creating visual representations of model diagnostics, predictions, and residuals.
  • gtsummary for creating publication-ready tables summarizing regression analysis results.

Each of these tools was chosen for its specific functionality that aids in robust data analysis, ensuring that each step of the model building and evaluation process is well-supported.

5.1.2 Linear Regression - With All Prices

5.1.2.1 Correlation Analysis - Continuous Variable

Initially, a correlation analysis was conducted to identify and visualize linear relationships between the property prices and potential predictive variables such as the number of rooms and square meters. The correlation matrix was computed and plotted using the corrplot package:

Code
correlation_matrix <- cor(properties_filtered[ , c("price", "number_of_rooms", "square_meters")], use="complete.obs")
corrplot(correlation_matrix, method="square", type="upper", tl.col="black", tl.srt=45, tl.cex=0.8, cl.cex=0.8, addCoef.col="black", number.cex=0.8, order="hclust", hclust.method="complete", tl.pos="lt", tl.offset=0.5, cl.pos="n", cl.length=1.5)

We can observe that the correlation between the number of rooms and price (0.02) and square meters and price (0.68) suggests a weak correlation with the number of rooms but a strong correlation with square meters. The number of rooms and square meters have a weak correlation (0.12), indicating they offer independent information for the model.

  • Question : How to make the correlation with categorical variables?
  • Question : Is VIF analysis really necessary and meaningful ?

5.1.2.2 Model Building

The dataset was split into training and testing sets to validate the model’s performance. The linear regression model was then fitted using selected predictors: ::: {.cell layout-align=“center”}

Code
# Model Building
## Split the Data
set.seed(123)  # for reproducibility
trainIndex <- createDataPartition(properties_filtered$price, p=0.8, list=FALSE)
trainData <- properties_filtered[trainIndex, ]
testData <- properties_filtered[-trainIndex, ]

## Fit the Model
lm_model <- lm(price ~ number_of_rooms + square_meters + property_type + floor + year_category  + Demographic_cluster +Political_cluster + Tax_cluster , data=trainData)

summary(lm_model)
#> 
#> Call:
#> lm(formula = price ~ number_of_rooms + square_meters + property_type + 
#>     floor + year_category + Demographic_cluster + Political_cluster + 
#>     Tax_cluster, data = trainData)
#> 
#> Residuals:
#>       Min        1Q    Median        3Q       Max 
#> -19094283   -362505   -111460    201785  16595539 
#> 
#> Coefficients: (1 not defined because of singularities)
#>                               Estimate Std. Error t value Pr(>|t|)
#> (Intercept)                    -396489      51391   -7.72  1.3e-14
#> number_of_rooms                  16483       6354    2.59  0.00949
#> square_meters                     8621        106   81.28  < 2e-16
#> property_typeAttic flat         117222      41558    2.82  0.00480
#> property_typeBifamiliar house  -214970      39424   -5.45  5.0e-08
#> property_typeChalet             189879      49946    3.80  0.00014
#> property_typeDuplex            -119296      50551   -2.36  0.01829
#> property_typeFarm house        -510786     114730   -4.45  8.6e-06
#> property_typeLoft              -158775     262250   -0.61  0.54490
#> property_typeRoof flat          -86370      58737   -1.47  0.14146
#> property_typeRustic house       -18453     232286   -0.08  0.93668
#> property_typeSingle house      -141364      23300   -6.07  1.3e-09
#> property_typeTerrace flat       -21620      83829   -0.26  0.79648
#> property_typeVilla              184138      36946    4.98  6.3e-07
#> flooreg                          19216      23857    0.81  0.42055
#> floornoteg                          NA         NA      NA       NA
#> year_category1919-1945          225546      54765    4.12  3.8e-05
#> year_category1946-1960          288880      51506    5.61  2.1e-08
#> year_category1961-1970          224755      43624    5.15  2.6e-07
#> year_category1971-1980          316943      38911    8.15  4.1e-16
#> year_category1981-1990          275729      39293    7.02  2.4e-12
#> year_category1991-2000          330105      40819    8.09  6.5e-16
#> year_category2001-2005          472269      49326    9.57  < 2e-16
#> year_category2006-2010          533813      43410   12.30  < 2e-16
#> year_category2011-2015          578988      42214   13.72  < 2e-16
#> year_category2016-2024          421772      33152   12.72  < 2e-16
#> Demographic_cluster               8870       6454    1.37  0.16930
#> Political_cluster               -47861       6305   -7.59  3.3e-14
#> Tax_cluster                      31090       5512    5.64  1.7e-08
#>                                  
#> (Intercept)                   ***
#> number_of_rooms               ** 
#> square_meters                 ***
#> property_typeAttic flat       ** 
#> property_typeBifamiliar house ***
#> property_typeChalet           ***
#> property_typeDuplex           *  
#> property_typeFarm house       ***
#> property_typeLoft                
#> property_typeRoof flat           
#> property_typeRustic house        
#> property_typeSingle house     ***
#> property_typeTerrace flat        
#> property_typeVilla            ***
#> flooreg                          
#> floornoteg                       
#> year_category1919-1945        ***
#> year_category1946-1960        ***
#> year_category1961-1970        ***
#> year_category1971-1980        ***
#> year_category1981-1990        ***
#> year_category1991-2000        ***
#> year_category2001-2005        ***
#> year_category2006-2010        ***
#> year_category2011-2015        ***
#> year_category2016-2024        ***
#> Demographic_cluster              
#> Political_cluster             ***
#> Tax_cluster                   ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 979000 on 16449 degrees of freedom
#> Multiple R-squared:  0.473,  Adjusted R-squared:  0.472 
#> F-statistic:  546 on 27 and 16449 DF,  p-value: <2e-16

:::

5.1.2.3 Model Evaluation

Diagnostic checks such as residual analysis and normality tests were conducted to validate model assumptions. Performance metrics including R-squared and RMSE were calculated to assess the model’s explanatory power and prediction accuracy.

Code
sum(is.na(testData$price))  # Check NAs in price
#> [1] 0
sapply(testData, function(x) sum(is.na(x)))  # Check NAs in all predictors
#>            zip_code               price     number_of_rooms 
#>                   0                   0                   0 
#>       square_meters             address              canton 
#>                   0                   0                   0 
#>       property_type               floor       year_category 
#>                   0                   0                   0 
#>           Community         Canton_code                 lon 
#>                   0                   0                   0 
#>                 lat Demographic_cluster   Political_cluster 
#>                   0                   0                   0 
#>         Tax_cluster 
#>                   0
  • Question : Do we need the plots ? Or can we delete them ?
Code
# Model Evaluation
## Diagnostic Checks
#plot(lm_model)
#ad.test(residuals(lm_model))

#use gt summary to show the result
tbl_reg_1 <- gtsummary::tbl_regression(lm_model)
tbl_reg_1
Characteristic Beta 95% CI1 p-value
number_of_rooms 16,483 4,030, 28,937 0.009
square_meters 8,621 8,413, 8,829 <0.001
property_type


    Apartment
    Attic flat 117,222 35,764, 198,679 0.005
    Bifamiliar house -214,970 -292,246, -137,693 <0.001
    Chalet 189,879 91,980, 287,779 <0.001
    Duplex -119,296 -218,381, -20,212 0.018
    Farm house -510,786 -735,669, -285,903 <0.001
    Loft -158,775 -672,814, 355,263 0.5
    Roof flat -86,370 -201,502, 28,761 0.14
    Rustic house -18,453 -473,759, 436,854 >0.9
    Single house -141,364 -187,034, -95,694 <0.001
    Terrace flat -21,620 -185,933, 142,693 0.8
    Villa 184,138 111,719, 256,557 <0.001
floor


    floor
    eg 19,216 -27,545, 65,978 0.4
    noteg


year_category


    0-1919
    1919-1945 225,546 118,201, 332,890 <0.001
    1946-1960 288,880 187,923, 389,837 <0.001
    1961-1970 224,755 139,248, 310,263 <0.001
    1971-1980 316,943 240,672, 393,213 <0.001
    1981-1990 275,729 198,711, 352,747 <0.001
    1991-2000 330,105 250,095, 410,115 <0.001
    2001-2005 472,269 375,585, 568,953 <0.001
    2006-2010 533,813 448,724, 618,902 <0.001
    2011-2015 578,987 496,243, 661,732 <0.001
    2016-2024 421,772 356,791, 486,752 <0.001
Demographic_cluster 8,871 -3,779, 21,520 0.2
Political_cluster -47,861 -60,220, -35,503 <0.001
Tax_cluster 31,090 20,287, 41,893 <0.001
1 CI = Confidence Interval
5.1.2.3.1 Metrics

Here are the performance metrics for the initial model: ::: {.cell layout-align=“center”}

Code
# print R-squared and Adj R-squared and RMSE and MAE and AIC
r_sq <- summary(lm_model)$r.squared
adj_r_sq <- summary(lm_model)$adj.r.squared
rmse <- rmse(testData$price, predict(lm_model, newdata=testData))
mae <- mae(testData$price, predict(lm_model, newdata=testData))
aic <- AIC(lm_model)
# show those metrics in a html table
metrics_1 <- kable(data.frame(r_sq, adj_r_sq, rmse, mae, aic), format = "html", col.names = c("R-Squared", "Adjusted R-Squared", "RMSE", "MAE", "AIC")) %>%
  kable_styling(position = "center", bootstrap_options = c("striped", "bordered", "hover", "condensed"))  %>%
  add_header_above(c("Basic Model Performance Metrics" = 5)) 
metrics_1
Basic Model Performance Metrics
R-Squared Adjusted R-Squared RMSE MAE AIC
0.473 0.472 965171 493713 501375

:::

5.1.2.4 Variable Selection

Stepwise regression was performed to refine the model and improve its predictive performance. The resulting model was evaluated using the same diagnostic checks and performance metrics as the initial model:

Code
# stepwise regression
lm_model2 <- step(lm_model)
#> Start:  AIC=454613
#> price ~ number_of_rooms + square_meters + property_type + floor + 
#>     year_category + Demographic_cluster + Political_cluster + 
#>     Tax_cluster
#> 
#>                       Df Sum of Sq      RSS    AIC
#> - floor                1  6.22e+11 1.58e+16 454612
#> - Demographic_cluster  1  1.81e+12 1.58e+16 454613
#> <none>                             1.58e+16 454613
#> - number_of_rooms      1  6.45e+12 1.58e+16 454618
#> - Tax_cluster          1  3.05e+13 1.58e+16 454643
#> - Political_cluster    1  5.53e+13 1.58e+16 454669
#> - property_type       10  1.51e+14 1.59e+16 454750
#> - year_category       10  2.73e+14 1.60e+16 454876
#> - square_meters        1  6.34e+15 2.21e+16 460175
#> 
#> Step:  AIC=454612
#> price ~ number_of_rooms + square_meters + property_type + year_category + 
#>     Demographic_cluster + Political_cluster + Tax_cluster
#> 
#>                       Df Sum of Sq      RSS    AIC
#> - Demographic_cluster  1  1.86e+12 1.58e+16 454612
#> <none>                             1.58e+16 454612
#> - number_of_rooms      1  6.49e+12 1.58e+16 454617
#> - Tax_cluster          1  3.04e+13 1.58e+16 454642
#> - Political_cluster    1  5.56e+13 1.58e+16 454668
#> - property_type       11  1.69e+14 1.59e+16 454765
#> - year_category       10  2.75e+14 1.60e+16 454877
#> - square_meters        1  6.33e+15 2.21e+16 460173
#> 
#> Step:  AIC=454612
#> price ~ number_of_rooms + square_meters + property_type + year_category + 
#>     Political_cluster + Tax_cluster
#> 
#>                     Df Sum of Sq      RSS    AIC
#> <none>                           1.58e+16 454612
#> - number_of_rooms    1  6.20e+12 1.58e+16 454616
#> - Tax_cluster        1  3.03e+13 1.58e+16 454641
#> - Political_cluster  1  5.44e+13 1.58e+16 454666
#> - property_type     11  1.71e+14 1.59e+16 454767
#> - year_category     10  2.74e+14 1.60e+16 454875
#> - square_meters      1  6.35e+15 2.21e+16 460181

#use gt summary to show the result 
tbl_reg_2 <- gtsummary::tbl_regression(lm_model2)
tbl_reg_3 <- tbl_merge(
  tbls= list(tbl_reg_1, tbl_reg_2),
  tab_spanner = c("**Model 1**", "**Model Stepwise**")
  )
tbl_reg_3
Characteristic Model 1 Model Stepwise
Beta 95% CI1 p-value Beta 95% CI1 p-value
number_of_rooms 16,483 4,030, 28,937 0.009 16,145 3,703, 28,587 0.011
square_meters 8,621 8,413, 8,829 <0.001 8,625 8,417, 8,833 <0.001
property_type





    Apartment

    Attic flat 117,222 35,764, 198,679 0.005 112,173 31,422, 192,924 0.006
    Bifamiliar house -214,970 -292,246, -137,693 <0.001 -220,526 -297,006, -144,047 <0.001
    Chalet 189,879 91,980, 287,779 <0.001 194,272 97,683, 290,861 <0.001
    Duplex -119,296 -218,381, -20,212 0.018 -119,952 -219,035, -20,869 0.018
    Farm house -510,786 -735,669, -285,903 <0.001 -515,797 -740,486, -291,108 <0.001
    Loft -158,775 -672,814, 355,263 0.5 -160,370 -674,402, 353,662 0.5
    Roof flat -86,370 -201,502, 28,761 0.14 -90,781 -205,417, 23,854 0.12
    Rustic house -18,453 -473,759, 436,854 >0.9 -16,391 -471,553, 438,772 >0.9
    Single house -141,364 -187,034, -95,694 <0.001 -144,225 -188,806, -99,644 <0.001
    Terrace flat -21,620 -185,933, 142,693 0.8 -21,086 -185,327, 143,154 0.8
    Villa 184,138 111,719, 256,557 <0.001 180,375 108,643, 252,107 <0.001
floor





    floor



    eg 19,216 -27,545, 65,978 0.4


    noteg





year_category





    0-1919

    1919-1945 225,546 118,201, 332,890 <0.001 224,992 117,654, 332,331 <0.001
    1946-1960 288,880 187,923, 389,837 <0.001 288,083 187,130, 389,036 <0.001
    1961-1970 224,755 139,248, 310,263 <0.001 224,236 138,731, 309,741 <0.001
    1971-1980 316,943 240,672, 393,213 <0.001 316,190 239,933, 392,447 <0.001
    1981-1990 275,729 198,711, 352,747 <0.001 274,932 197,922, 351,942 <0.001
    1991-2000 330,105 250,095, 410,115 <0.001 329,088 249,094, 409,081 <0.001
    2001-2005 472,269 375,585, 568,953 <0.001 470,492 373,863, 567,122 <0.001
    2006-2010 533,813 448,724, 618,902 <0.001 532,571 447,515, 617,627 <0.001
    2011-2015 578,987 496,243, 661,732 <0.001 577,569 494,875, 660,262 <0.001
    2016-2024 421,772 356,791, 486,752 <0.001 422,209 357,376, 487,041 <0.001
Demographic_cluster 8,871 -3,779, 21,520 0.2


Political_cluster -47,861 -60,220, -35,503 <0.001 -47,347 -59,667, -35,028 <0.001
Tax_cluster 31,090 20,287, 41,893 <0.001 30,965 20,163, 41,767 <0.001
1 CI = Confidence Interval
5.1.2.4.1 Metrics

Here we compare the performance metrics of the initial model and the stepwise model: ::: {.cell layout-align=“center”}

Code
# print R-squared and Adj R-squared and RMSE and MAE and AIC
r_sq <- summary(lm_model2)$r.squared
adj_r_sq <- summary(lm_model2)$adj.r.squared
rmse <- rmse(testData$price, predict(lm_model2, newdata=testData))
mae <- mae(testData$price, predict(lm_model2, newdata=testData))
aic <- AIC(lm_model2)
# show those metrics in a html table
metrics_2 <- kable(data.frame(r_sq, adj_r_sq, rmse, mae, aic), format = "html", col.names = c("R-Squared", "Adjusted R-Squared", "RMSE", "MAE", "AIC")) %>%
  kable_styling(position = "center", bootstrap_options = c("striped", "bordered", "hover", "condensed"))  %>%
  add_header_above(c("Stepwise Model Performance Metrics" = 5)) 
metrics_2
Stepwise Model Performance Metrics
R-Squared Adjusted R-Squared RMSE MAE AIC
0.473 0.472 965256 493985 501373
Code
metrics_1
Basic Model Performance Metrics
R-Squared Adjusted R-Squared RMSE MAE AIC
0.473 0.472 965171 493713 501375

:::

5.1.2.5 Cross-Validation

Cross-validation was used to assess the model’s robustness, typically to avoid overfitting and ensure that the model generalizes well to new data., using the caret package to manage this process efficiently. The CV results show metrics like RMSE and R-squared across folds, which indicate the model’s performance stability across different subsets of the data.

Code
#add + Demographic_cluster +Political_cluster + Tax_cluster after dealing with NAN
## Cross-Validation
cv_results <- train(price ~ number_of_rooms + square_meters + year_category + property_type , data=trainData, method="lm", trControl=trainControl(method="cv", number=10))
summary(cv_results)
#> 
#> Call:
#> lm(formula = .outcome ~ ., data = dat)
#> 
#> Residuals:
#>       Min        1Q    Median        3Q       Max 
#> -18710178   -362677   -115403    196960  16761951 
#> 
#> Coefficients:
#>                                 Estimate Std. Error t value Pr(>|t|)
#> (Intercept)                      -458336      40280  -11.38  < 2e-16
#> number_of_rooms                    25064       6334    3.96  7.6e-05
#> square_meters                       8510        106   80.33  < 2e-16
#> `year_category1919-1945`          233395      55004    4.24  2.2e-05
#> `year_category1946-1960`          284299      51735    5.50  4.0e-08
#> `year_category1961-1970`          221613      43818    5.06  4.3e-07
#> `year_category1971-1980`          304335      39063    7.79  7.0e-15
#> `year_category1981-1990`          278800      39462    7.06  1.7e-12
#> `year_category1991-2000`          331674      40977    8.09  6.2e-16
#> `year_category2001-2005`          479359      49496    9.68  < 2e-16
#> `year_category2006-2010`          533222      43585   12.23  < 2e-16
#> `year_category2011-2015`          568439      42372   13.42  < 2e-16
#> `year_category2016-2024`          415840      33203   12.52  < 2e-16
#> `property_typeAttic flat`         115759      41377    2.80  0.00515
#> `property_typeBifamiliar house`  -198123      39137   -5.06  4.2e-07
#> property_typeChalet               173521      49288    3.52  0.00043
#> property_typeDuplex               -93790      50718   -1.85  0.06444
#> `property_typeFarm house`        -487278     115093   -4.23  2.3e-05
#> property_typeLoft                 -93595     263370   -0.36  0.72231
#> `property_typeRoof flat`          -63989      58694   -1.09  0.27564
#> `property_typeRustic house`      -118736     233077   -0.51  0.61046
#> `property_typeSingle house`      -143963      22844   -6.30  3.0e-10
#> `property_typeTerrace flat`       -32966      84163   -0.39  0.69529
#> property_typeVilla                155989      36692    4.25  2.1e-05
#>                                    
#> (Intercept)                     ***
#> number_of_rooms                 ***
#> square_meters                   ***
#> `year_category1919-1945`        ***
#> `year_category1946-1960`        ***
#> `year_category1961-1970`        ***
#> `year_category1971-1980`        ***
#> `year_category1981-1990`        ***
#> `year_category1991-2000`        ***
#> `year_category2001-2005`        ***
#> `year_category2006-2010`        ***
#> `year_category2011-2015`        ***
#> `year_category2016-2024`        ***
#> `property_typeAttic flat`       ** 
#> `property_typeBifamiliar house` ***
#> property_typeChalet             ***
#> property_typeDuplex             .  
#> `property_typeFarm house`       ***
#> property_typeLoft                  
#> `property_typeRoof flat`           
#> `property_typeRustic house`        
#> `property_typeSingle house`     ***
#> `property_typeTerrace flat`        
#> property_typeVilla              ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 984000 on 16453 degrees of freedom
#> Multiple R-squared:  0.468,  Adjusted R-squared:  0.467 
#> F-statistic:  629 on 23 and 16453 DF,  p-value: <2e-16
#show the CV result in the html
metrics_cv_1 <- kable(cv_results$results, format = "html") %>%
  kable_styling(position = "center", bootstrap_options = c("striped", "bordered", "hover", "condensed"))  %>%
  add_header_above(c("10 Fold Cross Validation Metrics" = 7))
metrics_cv_1
10 Fold Cross Validation Metrics
intercept RMSE Rsquared MAE RMSESD RsquaredSD MAESD
TRUE 987246 0.476 506542 96757 0.061 24562
Code
metrics_1
Basic Model Performance Metrics
R-Squared Adjusted R-Squared RMSE MAE AIC
0.473 0.472 965171 493713 501375
Code
metrics_2
Stepwise Model Performance Metrics
R-Squared Adjusted R-Squared RMSE MAE AIC
0.473 0.472 965256 493985 501373
  • Write the comparison with stepwise model, seems robust ?

5.1.2.6 Model testing

The final model was tested using the unseen test dataset to evaluate its predictive accuracy. Residual plots and actual vs. predicted price plots were created to visually assess model performance:

5.1.2.6.1 Residual vs Predicted Prices

This plot shows residuals (differences between observed and predicted prices) against predicted prices. Ideally, residuals should randomly scatter around the horizontal line at zero, indicating that the model doesn’t systematically overestimate or underestimate prices.

Code
# Model Testing 
## Test the Model
predicted_prices <- predict(lm_model2, newdata=testData)
testData$predicted_prices <- predicted_prices  # Add to testData to ensure alignment
# Calculate residuals
testData$test_residuals <- testData$price - predicted_prices  # Manually compute residuals

# Residual Analysis
gg <- ggplot(data = testData, aes(x = predicted_prices, y = test_residuals)) +
  geom_point() +
  geom_smooth(method = "lm", color = "blue") +
  labs(title = "Residuals vs Predicted Prices", x = "Predicted Prices", y = "Residuals")

# Convert ggplot to plotly
p <- ggplotly(gg, width = 600, height = 400)

# Show the interactive plot
p

As we can observe, the spread of residuals suggests potential issues with model fit, particularly for higher price predictions where the variance seems to increase.

5.1.2.6.2 Actual vs Predicted Prices

This plot should ideally show points along the diagonal line, where actual prices equal predicted prices. The data clustering along the line suggests a generally good model fit, but here we can observe the spread which indicates variance in predictions, especially at higher price points. ::: {.cell layout-align=“center”}

Code
## Visualization
gg <- ggplot(data=testData, aes(x=predicted_prices, y=price)) +
    geom_point() +
    geom_smooth(method="lm", col="blue") +
    labs(title="Actual vs Predicted Prices", x="Predicted Prices", y="Actual Prices")

# Convert ggplot to plotly
p <- ggplotly(gg, width = 600, height = 400)

# Show the interactive plot
p

:::

5.1.3 Linear Regression between 10th and 90th percentile

To solve this issue of variance at higher price points, we can filter the data to focus on a more specific range of prices. Here, we filter the data between the 10th and 90th percentiles of the price distribution to see if the model performs better within this range:

Code
#filter properties_filtered based on the 10th percentile and 90th percentile of the data
properties_filtered <- properties_filtered %>% 
  filter(price >= quantile(price, 0.1) & price <= quantile(price, 0.9))

5.1.3.1 Model Building

Code
# Model Building
## Split the Data
set.seed(123)  # for reproducibility
trainIndex <- createDataPartition(properties_filtered$price, p=0.8, list=FALSE)
trainData <- properties_filtered[trainIndex, ]
testData <- properties_filtered[-trainIndex, ]

## Fit the Model
lm_model_10_90 <- lm(price ~ number_of_rooms + square_meters + property_type + floor + year_category  + Political_cluster + Tax_cluster + Demographic_cluster, data=trainData)

summary(lm_model_10_90)
#> 
#> Call:
#> lm(formula = price ~ number_of_rooms + square_meters + property_type + 
#>     floor + year_category + Political_cluster + Tax_cluster + 
#>     Demographic_cluster, data = trainData)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -3909026  -265248   -87537   201296  1667347 
#> 
#> Coefficients: (1 not defined because of singularities)
#>                               Estimate Std. Error t value Pr(>|t|)
#> (Intercept)                   409591.3    24122.6   16.98  < 2e-16
#> number_of_rooms                10456.0     3263.4    3.20   0.0014
#> square_meters                   3271.1       70.6   46.35  < 2e-16
#> property_typeAttic flat       114901.1    17702.1    6.49  8.8e-11
#> property_typeBifamiliar house 116094.3    16521.4    7.03  2.2e-12
#> property_typeChalet            99042.7    23699.5    4.18  2.9e-05
#> property_typeDuplex             7588.5    21109.7    0.36   0.7192
#> property_typeFarm house       139328.0    50212.0    2.77   0.0055
#> property_typeLoft             -69118.8    99951.6   -0.69   0.4892
#> property_typeRoof flat          3982.5    25254.9    0.16   0.8747
#> property_typeRustic house       1111.5   223386.6    0.00   0.9960
#> property_typeSingle house      74104.9    10523.0    7.04  2.0e-12
#> property_typeTerrace flat     106647.0    33804.3    3.15   0.0016
#> property_typeVilla            137796.0    17149.1    8.04  1.0e-15
#> flooreg                        24652.1    10258.1    2.40   0.0163
#> floornoteg                          NA         NA      NA       NA
#> year_category1919-1945         44135.4    25311.8    1.74   0.0812
#> year_category1946-1960         65480.6    24031.7    2.72   0.0064
#> year_category1961-1970        160967.0    20788.1    7.74  1.0e-14
#> year_category1971-1980        159924.0    18442.0    8.67  < 2e-16
#> year_category1981-1990        159944.8    18324.1    8.73  < 2e-16
#> year_category1991-2000        154760.1    18811.0    8.23  < 2e-16
#> year_category2001-2005        301789.8    22613.6   13.35  < 2e-16
#> year_category2006-2010        336142.4    19922.8   16.87  < 2e-16
#> year_category2011-2015        322787.8    19382.9   16.65  < 2e-16
#> year_category2016-2024        219954.4    15625.7   14.08  < 2e-16
#> Political_cluster             -29485.1     2704.5  -10.90  < 2e-16
#> Tax_cluster                     -597.2     2376.3   -0.25   0.8016
#> Demographic_cluster             5532.7     2869.1    1.93   0.0538
#>                                  
#> (Intercept)                   ***
#> number_of_rooms               ** 
#> square_meters                 ***
#> property_typeAttic flat       ***
#> property_typeBifamiliar house ***
#> property_typeChalet           ***
#> property_typeDuplex              
#> property_typeFarm house       ** 
#> property_typeLoft                
#> property_typeRoof flat           
#> property_typeRustic house        
#> property_typeSingle house     ***
#> property_typeTerrace flat     ** 
#> property_typeVilla            ***
#> flooreg                       *  
#> floornoteg                       
#> year_category1919-1945        .  
#> year_category1946-1960        ** 
#> year_category1961-1970        ***
#> year_category1971-1980        ***
#> year_category1981-1990        ***
#> year_category1991-2000        ***
#> year_category2001-2005        ***
#> year_category2006-2010        ***
#> year_category2011-2015        ***
#> year_category2016-2024        ***
#> Political_cluster             ***
#> Tax_cluster                      
#> Demographic_cluster           .  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 386000 on 13217 degrees of freedom
#> Multiple R-squared:  0.316,  Adjusted R-squared:  0.314 
#> F-statistic:  226 on 27 and 13217 DF,  p-value: <2e-16

5.1.3.2 Model Evaluation

Code
# Model Evaluation
## Diagnostic Checks
#plot(lm_model)
#ad.test(residuals(lm_model))

#use gt summary to show the result
tbl_reg_1_10_90 <- gtsummary::tbl_regression(lm_model_10_90)

tbl_reg_1_vs_10_90 <- tbl_merge(
  tbls= list(tbl_reg_1, tbl_reg_1_10_90),
  tab_spanner = c("**Basic Model (All Prices)**", "**Basic Model (10-90 Qt)**")
  )
tbl_reg_1_vs_10_90
Characteristic Basic Model (All Prices) Basic Model (10-90 Qt)
Beta 95% CI1 p-value Beta 95% CI1 p-value
number_of_rooms 16,483 4,030, 28,937 0.009 10,456 4,059, 16,853 0.001
square_meters 8,621 8,413, 8,829 <0.001 3,271 3,133, 3,409 <0.001
property_type





    Apartment

    Attic flat 117,222 35,764, 198,679 0.005 114,901 80,202, 149,600 <0.001
    Bifamiliar house -214,970 -292,246, -137,693 <0.001 116,094 83,710, 148,479 <0.001
    Chalet 189,879 91,980, 287,779 <0.001 99,043 52,588, 145,497 <0.001
    Duplex -119,296 -218,381, -20,212 0.018 7,588 -33,790, 48,967 0.7
    Farm house -510,786 -735,669, -285,903 <0.001 139,328 40,905, 237,751 0.006
    Loft -158,775 -672,814, 355,263 0.5 -69,119 -265,038, 126,801 0.5
    Roof flat -86,370 -201,502, 28,761 0.14 3,982 -45,521, 53,486 0.9
    Rustic house -18,453 -473,759, 436,854 >0.9 1,111 -436,758, 438,981 >0.9
    Single house -141,364 -187,034, -95,694 <0.001 74,105 53,478, 94,731 <0.001
    Terrace flat -21,620 -185,933, 142,693 0.8 106,647 40,386, 172,908 0.002
    Villa 184,138 111,719, 256,557 <0.001 137,796 104,181, 171,411 <0.001
floor





    floor

    eg 19,216 -27,545, 65,978 0.4 24,652 4,545, 44,759 0.016
    noteg





year_category





    0-1919

    1919-1945 225,546 118,201, 332,890 <0.001 44,135 -5,479, 93,750 0.081
    1946-1960 288,880 187,923, 389,837 <0.001 65,481 18,375, 112,586 0.006
    1961-1970 224,755 139,248, 310,263 <0.001 160,967 120,219, 201,715 <0.001
    1971-1980 316,943 240,672, 393,213 <0.001 159,924 123,775, 196,073 <0.001
    1981-1990 275,729 198,711, 352,747 <0.001 159,945 124,027, 195,863 <0.001
    1991-2000 330,105 250,095, 410,115 <0.001 154,760 117,888, 191,632 <0.001
    2001-2005 472,269 375,585, 568,953 <0.001 301,790 257,464, 346,116 <0.001
    2006-2010 533,813 448,724, 618,902 <0.001 336,142 297,091, 375,194 <0.001
    2011-2015 578,987 496,243, 661,732 <0.001 322,788 284,795, 360,781 <0.001
    2016-2024 421,772 356,791, 486,752 <0.001 219,954 189,326, 250,583 <0.001
Demographic_cluster 8,871 -3,779, 21,520 0.2 5,533 -91, 11,157 0.054
Political_cluster -47,861 -60,220, -35,503 <0.001 -29,485 -34,786, -24,184 <0.001
Tax_cluster 31,090 20,287, 41,893 <0.001 -597 -5,255, 4,061 0.8
1 CI = Confidence Interval
5.1.3.2.1 Metrics

Here are the performance metrics for the model with prices between the 10th and 90th percentiles: ::: {.cell layout-align=“center”}

Code
# print R-squared and Adj R-squared and RMSE and MAE and AIC
r_sq <- summary(lm_model_10_90)$r.squared
adj_r_sq <- summary(lm_model_10_90)$adj.r.squared
rmse <- rmse(testData$price, predict(lm_model_10_90))
#> Warning in actual - predicted: longer object length is not a
#> multiple of shorter object length
mae <- mae(testData$price, predict(lm_model_10_90, newdata=testData))
aic <- AIC(lm_model_10_90)
# show those metrics in a html table
metrics_1_10_90 <- kable(data.frame(r_sq, adj_r_sq, rmse, mae, aic), format = "html", col.names = c("R-Squared", "Adjusted R-Squared", "RMSE", "MAE", "AIC")) %>%
  kable_styling(position = "center", bootstrap_options = c("striped", "bordered", "hover", "condensed")) %>%
  add_header_above(c("Basic Model Performance Metrics (10-90 Qt)" = 5))  
metrics_1_10_90
Basic Model Performance Metrics (10-90 Qt)
R-Squared Adjusted R-Squared RMSE MAE AIC
0.316 0.314 529332 295819 378362

:::

Here was the previous metrics of our first Basic model (without the 10-90 Qt filter) ::: {.cell layout-align=“center”}

Code
metrics_2
Stepwise Model Performance Metrics
R-Squared Adjusted R-Squared RMSE MAE AIC
0.473 0.472 965256 493985 501373

:::

5.1.3.3 Variable Selection

Code
# stepwise regression
lm_model2_10_90 <- step(lm_model_10_90)
#> Start:  AIC=340772
#> price ~ number_of_rooms + square_meters + property_type + floor + 
#>     year_category + Political_cluster + Tax_cluster + Demographic_cluster
#> 
#>                       Df Sum of Sq      RSS    AIC
#> - Tax_cluster          1  9.40e+09 1.97e+15 340770
#> <none>                             1.97e+15 340772
#> - Demographic_cluster  1  5.54e+11 1.97e+15 340774
#> - floor                1  8.60e+11 1.97e+15 340776
#> - number_of_rooms      1  1.53e+12 1.97e+15 340780
#> - property_type       10  1.04e+13 1.98e+15 340822
#> - Political_cluster    1  1.77e+13 1.99e+15 340889
#> - year_category       10  7.76e+13 2.05e+15 341265
#> - square_meters        1  3.20e+14 2.29e+15 342765
#> 
#> Step:  AIC=340770
#> price ~ number_of_rooms + square_meters + property_type + floor + 
#>     year_category + Political_cluster + Demographic_cluster
#> 
#>                       Df Sum of Sq      RSS    AIC
#> <none>                             1.97e+15 340770
#> - Demographic_cluster  1  5.53e+11 1.97e+15 340772
#> - floor                1  8.64e+11 1.97e+15 340774
#> - number_of_rooms      1  1.52e+12 1.97e+15 340778
#> - property_type       10  1.04e+13 1.98e+15 340820
#> - Political_cluster    1  2.05e+13 1.99e+15 340905
#> - year_category       10  7.77e+13 2.05e+15 341263
#> - square_meters        1  3.20e+14 2.29e+15 342763
# plot(lm_model2)
# ad.test(residuals(lm_model2))

lm_model2_10_90
#> 
#> Call:
#> lm(formula = price ~ number_of_rooms + square_meters + property_type + 
#>     floor + year_category + Political_cluster + Demographic_cluster, 
#>     data = trainData)
#> 
#> Coefficients:
#>                   (Intercept)                number_of_rooms  
#>                        407484                          10424  
#>                 square_meters        property_typeAttic flat  
#>                          3271                         114977  
#> property_typeBifamiliar house            property_typeChalet  
#>                        116098                          99476  
#>           property_typeDuplex        property_typeFarm house  
#>                          7626                         139306  
#>             property_typeLoft         property_typeRoof flat  
#>                        -69620                           3953  
#>     property_typeRustic house      property_typeSingle house  
#>                           457                          74134  
#>     property_typeTerrace flat             property_typeVilla  
#>                        106652                         137868  
#>                       flooreg                     floornoteg  
#>                         24707                             NA  
#>        year_category1919-1945         year_category1946-1960  
#>                         44044                          65475  
#>        year_category1961-1970         year_category1971-1980  
#>                        160915                         159994  
#>        year_category1981-1990         year_category1991-2000  
#>                        159960                         154844  
#>        year_category2001-2005         year_category2006-2010  
#>                        301867                         336186  
#>        year_category2011-2015         year_category2016-2024  
#>                        322734                         220053  
#>             Political_cluster            Demographic_cluster  
#>                        -29220                           5529

#use gt summary to show the result 
tbl_reg_2_10_90 <- gtsummary::tbl_regression(lm_model2_10_90)
tbl_reg_3_10_90 <- tbl_merge(
  tbls= list(tbl_reg_1_vs_10_90, tbl_reg_2_10_90),
  tab_spanner = c("**Basic Model (10-90 Qt)**", "**Stepwise Model (10-90 Qt)**")
  )
tbl_reg_3_10_90
Characteristic Basic Model (10-90 Qt) Stepwise Model (10-90 Qt)
Beta 95% CI1 p-value Beta 95% CI1 p-value Beta 95% CI1 p-value
number_of_rooms 16,483 4,030, 28,937 0.009 10,456 4,059, 16,853 0.001 10,424 4,033, 16,816 0.001
square_meters 8,621 8,413, 8,829 <0.001 3,271 3,133, 3,409 <0.001 3,271 3,133, 3,410 <0.001
property_type








    Apartment


    Attic flat 117,222 35,764, 198,679 0.005 114,901 80,202, 149,600 <0.001 114,977 80,284, 149,669 <0.001
    Bifamiliar house -214,970 -292,246, -137,693 <0.001 116,094 83,710, 148,479 <0.001 116,098 83,715, 148,481 <0.001
    Chalet 189,879 91,980, 287,779 <0.001 99,043 52,588, 145,497 <0.001 99,476 53,147, 145,806 <0.001
    Duplex -119,296 -218,381, -20,212 0.018 7,588 -33,790, 48,967 0.7 7,626 -33,749, 49,002 0.7
    Farm house -510,786 -735,669, -285,903 <0.001 139,328 40,905, 237,751 0.006 139,306 40,887, 237,725 0.006
    Loft -158,775 -672,814, 355,263 0.5 -69,119 -265,038, 126,801 0.5 -69,620 -265,493, 126,254 0.5
    Roof flat -86,370 -201,502, 28,761 0.14 3,982 -45,521, 53,486 0.9 3,953 -45,548, 53,454 0.9
    Rustic house -18,453 -473,759, 436,854 >0.9 1,111 -436,758, 438,981 >0.9 457 -437,367, 438,282 >0.9
    Single house -141,364 -187,034, -95,694 <0.001 74,105 53,478, 94,731 <0.001 74,134 53,509, 94,759 <0.001
    Terrace flat -21,620 -185,933, 142,693 0.8 106,647 40,386, 172,908 0.002 106,652 40,393, 172,911 0.002
    Villa 184,138 111,719, 256,557 <0.001 137,796 104,181, 171,411 <0.001 137,868 104,259, 171,477 <0.001
floor








    floor


    eg 19,216 -27,545, 65,978 0.4 24,652 4,545, 44,759 0.016 24,707 4,605, 44,809 0.016
    noteg








year_category








    0-1919


    1919-1945 225,546 118,201, 332,890 <0.001 44,135 -5,479, 93,750 0.081 44,044 -5,564, 93,652 0.082
    1946-1960 288,880 187,923, 389,837 <0.001 65,481 18,375, 112,586 0.006 65,475 18,371, 112,579 0.006
    1961-1970 224,755 139,248, 310,263 <0.001 160,967 120,219, 201,715 <0.001 160,915 120,171, 201,659 <0.001
    1971-1980 316,943 240,672, 393,213 <0.001 159,924 123,775, 196,073 <0.001 159,994 123,850, 196,137 <0.001
    1981-1990 275,729 198,711, 352,747 <0.001 159,945 124,027, 195,863 <0.001 159,960 124,044, 195,876 <0.001
    1991-2000 330,105 250,095, 410,115 <0.001 154,760 117,888, 191,632 <0.001 154,844 117,979, 191,709 <0.001
    2001-2005 472,269 375,585, 568,953 <0.001 301,790 257,464, 346,116 <0.001 301,867 257,547, 346,187 <0.001
    2006-2010 533,813 448,724, 618,902 <0.001 336,142 297,091, 375,194 <0.001 336,186 297,138, 375,235 <0.001
    2011-2015 578,987 496,243, 661,732 <0.001 322,788 284,795, 360,781 <0.001 322,734 284,744, 360,724 <0.001
    2016-2024 421,772 356,791, 486,752 <0.001 219,954 189,326, 250,583 <0.001 220,053 189,435, 250,671 <0.001
Demographic_cluster 8,871 -3,779, 21,520 0.2 5,533 -91, 11,157 0.054 5,529 -95, 11,152 0.054
Political_cluster -47,861 -60,220, -35,503 <0.001 -29,485 -34,786, -24,184 <0.001 -29,220 -34,101, -24,339 <0.001
Tax_cluster 31,090 20,287, 41,893 <0.001 -597 -5,255, 4,061 0.8


1 CI = Confidence Interval
5.1.3.3.1 Metrics

Here are the performance metrics for the stepwise model with prices between the 10th and 90th percentiles as well as the comparison with the initial model: ::: {.cell layout-align=“center”}

Code
## Performance Metrics
r_sq <- summary(lm_model2_10_90)$r.squared
adj_r_sq <- summary(lm_model2_10_90)$adj.r.squared
rmse <- rmse(testData$price, predict(lm_model2_10_90))
#> Warning in actual - predicted: longer object length is not a
#> multiple of shorter object length
mae <- mae(testData$price, predict(lm_model2_10_90, newdata=testData))
aic <- AIC(lm_model2_10_90)
# show those metrics in a html table
metrics_2_10_90 <- kable(data.frame(r_sq, adj_r_sq, rmse, mae, aic), format = "html", col.names = c("R-Squared", "Adjusted R-Squared", "RMSE", "MAE", "AIC")) %>%
  kable_styling(position = "center", bootstrap_options = c("striped", "bordered", "hover", "condensed")) %>%
  add_header_above(c("Stepwise Model Performance Metrics (10-90 Qt)" = 5))
metrics_2_10_90
Stepwise Model Performance Metrics (10-90 Qt)
R-Squared Adjusted R-Squared RMSE MAE AIC
0.316 0.314 529347 295810 378360

:::

Here was the previous metrics of our Basic Model (without Stepwise Integration) ::: {.cell layout-align=“center”}

Code
metrics_1_10_90
Basic Model Performance Metrics (10-90 Qt)
R-Squared Adjusted R-Squared RMSE MAE AIC
0.316 0.314 529332 295819 378362

:::

Here was the previous metrics of our stepwise model (without the 10-90 Qt filter) ::: {.cell layout-align=“center”}

Code
metrics_2
Stepwise Model Performance Metrics
R-Squared Adjusted R-Squared RMSE MAE AIC
0.473 0.472 965256 493985 501373

:::

5.1.3.4 Cross-Validation

Code
## Cross-Validation
cv_results2 <- train(price ~ number_of_rooms + square_meters + year_category + property_type, data=trainData, method="lm", trControl=trainControl(method="cv", number=10))
summary(cv_results2)
#> 
#> Call:
#> lm(formula = .outcome ~ ., data = dat)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -3780853  -265807   -92952   203954  1605843 
#> 
#> Coefficients:
#>                                 Estimate Std. Error t value Pr(>|t|)
#> (Intercept)                     339511.6    20122.3   16.87  < 2e-16
#> number_of_rooms                  13722.4     3263.3    4.21  2.6e-05
#> square_meters                     3188.0       70.5   45.20  < 2e-16
#> `year_category1919-1945`         45021.5    25441.5    1.77   0.0768
#> `year_category1946-1960`         60596.7    24157.5    2.51   0.0121
#> `year_category1961-1970`        158331.7    20893.4    7.58  3.7e-14
#> `year_category1971-1980`        153564.3    18527.9    8.29  < 2e-16
#> `year_category1981-1990`        158894.6    18422.1    8.63  < 2e-16
#> `year_category1991-2000`        156792.6    18907.8    8.29  < 2e-16
#> `year_category2001-2005`        304972.5    22729.3   13.42  < 2e-16
#> `year_category2006-2010`        336133.4    20026.6   16.78  < 2e-16
#> `year_category2011-2015`        316724.0    19475.4   16.26  < 2e-16
#> `year_category2016-2024`        219396.1    15677.8   13.99  < 2e-16
#> `property_typeAttic flat`       112262.6    17627.7    6.37  2.0e-10
#> `property_typeBifamiliar house` 118815.5    16405.6    7.24  4.7e-13
#> property_typeChalet             100036.5    23552.3    4.25  2.2e-05
#> property_typeDuplex              20502.5    21189.4    0.97   0.3333
#> `property_typeFarm house`       152275.1    50417.0    3.02   0.0025
#> property_typeLoft               -48313.3   100443.0   -0.48   0.6305
#> `property_typeRoof flat`         10763.2    25251.9    0.43   0.6699
#> `property_typeRustic house`     -27078.8   224539.8   -0.12   0.9040
#> `property_typeSingle house`      70341.2    10342.6    6.80  1.1e-11
#> `property_typeTerrace flat`     107479.0    33974.6    3.16   0.0016
#> property_typeVilla              124013.8    17069.0    7.27  3.9e-13
#>                                    
#> (Intercept)                     ***
#> number_of_rooms                 ***
#> square_meters                   ***
#> `year_category1919-1945`        .  
#> `year_category1946-1960`        *  
#> `year_category1961-1970`        ***
#> `year_category1971-1980`        ***
#> `year_category1981-1990`        ***
#> `year_category1991-2000`        ***
#> `year_category2001-2005`        ***
#> `year_category2006-2010`        ***
#> `year_category2011-2015`        ***
#> `year_category2016-2024`        ***
#> `property_typeAttic flat`       ***
#> `property_typeBifamiliar house` ***
#> property_typeChalet             ***
#> property_typeDuplex                
#> `property_typeFarm house`       ** 
#> property_typeLoft                  
#> `property_typeRoof flat`           
#> `property_typeRustic house`        
#> `property_typeSingle house`     ***
#> `property_typeTerrace flat`     ** 
#> property_typeVilla              ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 388000 on 13221 degrees of freedom
#> Multiple R-squared:  0.308,  Adjusted R-squared:  0.307 
#> F-statistic:  256 on 23 and 13221 DF,  p-value: <2e-16

#show the CV result in the html
metrics_cv2 <- kable(cv_results2$results, format = "html") %>%
  
  kable_styling(position = "center", bootstrap_options = c("striped", "bordered", "hover", "condensed"))  %>%
  add_header_above(c("10 Fold Cross Validation Metrics (10-90th Qt)" = 7))
metrics_cv2
10 Fold Cross Validation Metrics (10-90th Qt)
intercept RMSE Rsquared MAE RMSESD RsquaredSD MAESD
TRUE 388517 0.307 3e+05 13790 0.032 6244

Here was the previous metrics of our first Basic Model (without the 10-90 Qt filter) ::: {.cell layout-align=“center”}

Code
metrics_cv_1
10 Fold Cross Validation Metrics
intercept RMSE Rsquared MAE RMSESD RsquaredSD MAESD
TRUE 987246 0.476 506542 96757 0.061 24562

:::

5.1.3.5 Model testing

5.1.3.5.1 Residual vs Predicted Prices
Code
# Model Testing 
## Test the Model
predicted_prices <- predict(lm_model2_10_90, newdata=testData)
testData$predicted_prices <- predicted_prices  # Add to testData to ensure alignment
# Calculate residuals
testData$test_residuals <- testData$price - predicted_prices  # Manually compute residuals

# Residual Analysis
gg <- ggplot(data = testData, aes(x = predicted_prices, y = test_residuals)) +
  geom_point() +
  geom_smooth(method = "lm", color = "blue") +
  labs(title = "Residuals vs Predicted Prices", x = "Predicted Prices", y = "Residuals")

# Convert ggplot to plotly
p <- ggplotly(gg, width = 600, height = 400)

# Show the interactive plot
p
5.1.3.5.2 Actual vs Predicted Prices
Code
## Visualization
gg <- ggplot(data=testData, aes(x=predicted_prices, y=price)) +
    geom_point() +
    geom_smooth(method="lm", col="blue") +
    labs(title="Actual vs Predicted Prices", x="Predicted Prices", y="Actual Prices")

# Convert ggplot to plotly
p <- ggplotly(gg, width = 600, height = 400)

# Show the interactive plot
p

5.1.4 Improvement

  • Hyperparameter Tuning: Introduce more explicit tuning, perhaps through regularization techniques like Ridge or Lasso if moving beyond basic linear regression. In caret, this can be managed through the expand.grid function to set up a grid of possible values for various parameters.

5.2 Model 2

5.2.1 Random Forest

We decided to implement a Random Forest model to compare its performance with the linear regression model. Random Forest is an ensemble learning method that builds multiple decision trees during training and outputs the mode of the classes or the mean prediction of the individual trees. This model is known for its robustness and ability to handle complex relationships in the data.

5.2.1.1 Model Building

Code
#split the data in training and test set 0.8/0.2
set.seed(123)  # for reproducibility
trainIndex <- createDataPartition(properties_filtered$price, p=0.8, list=FALSE)
trainData <- properties_filtered[trainIndex, ]
testData <- properties_filtered[-trainIndex, ]

6 Conclusion

  • Brief summary of the project
  • Take home message
  • Limitations
  • Future work?